Message boards :
ATLAS application :
Download failures
Message board moderation
Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · Next
Author | Message |
---|---|
Send message Joined: 13 May 14 Posts: 387 Credit: 15,314,184 RAC: 0 |
I think I know what's causing this. Recently many ATLAS WU were cancelled on the ATLAS side, so even though the input files were removed it looks like the WU were not properly cancelled on the BOINC server. I will check why this is not working properly and fix it. |
Send message Joined: 2 May 07 Posts: 2240 Credit: 173,894,884 RAC: 3,757 |
Server-Status is dropping down from 3k to 600 avalaible new Tasks in a few hours! 17:50 UTC |
Send message Joined: 18 Dec 16 Posts: 123 Credit: 37,495,365 RAC: 0 |
On my side, out of the 4 files that are downloaded for each WU, all download OK except the big one (200Mbytes or more). We are the product of random evolution. |
Send message Joined: 15 Jun 08 Posts: 2528 Credit: 253,722,201 RAC: 62,755 |
This morning one of my clients got stuck with the following message: Do 06 Dez 2018 06:39:44 CET | LHC@home | Temporarily failed download of jf_25987fd8f8e73d45af1bb48d3c3e33a9: transient HTTP error The filename without any extension seems to be unusual for an ATLAS tasks. Reference: https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=104421083 I had to cancel the download manually. Other volunteers may be aware that the same WU is still active and could lead to a download lock. |
Send message Joined: 2 May 07 Posts: 2240 Credit: 173,894,884 RAC: 3,757 |
+1 also canceled. All tasks later downloading normal. |
Send message Joined: 15 Jun 08 Posts: 2528 Credit: 253,722,201 RAC: 62,755 |
@David Cameron Could be a good idea to query the server DB for those malformed filenames and cancel the WUs before they are sent out again and again. Thanks in advance. |
Send message Joined: 2 May 07 Posts: 2240 Credit: 173,894,884 RAC: 3,757 |
Since 18:50 UTC, some download Error are back! |
Send message Joined: 13 May 14 Posts: 387 Credit: 15,314,184 RAC: 0 |
We got another burst of cancelled tasks - I'm cancelling them on the BOINC server now. |
Send message Joined: 2 May 07 Posts: 2240 Credit: 173,894,884 RAC: 3,757 |
David, we have download-errors from 2018 June and July in our stats from Atlas: WorkUnit not found, for example: https://lhcathome.cern.ch/lhcathome/result.php?resultid=199995337 Before this months come again in 2019, is it a lot of work to eliminate them from the database? Thank you. |
Send message Joined: 2 May 07 Posts: 2240 Credit: 173,894,884 RAC: 3,757 |
Today there are download errors sometime.with windows and native Linux: WU download error: couldn't get input files: <file_xfer_error> <file_name>515KDmBJBfunlyackoJh5iwnABFKDmABFKDmnSSXDmABFKDmwBaSwm_EVNT.16926746._000684.pool.root.1</file_name> <error_code>-224 (permanent HTTP error)</error_code> <error_message>permanent HTTP error</error_message> </file_xfer_error> |
Send message Joined: 7 Aug 14 Posts: 27 Credit: 10,000,233 RAC: 2,828 |
Seeing a lot of these download errors today. As a side note, it says Max number of errors is 3 but it always seem to download a 4th attempt which also fails. Shouldn't it stop and bug out when it hits 3 errors ? Example here: https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=111108234 |
Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0 |
Seeing a lot of these download errors today. I just attached a machine again to get native ATLAS. The first one gave a download error, and then the next four were OK. https://lhcathome.cern.ch/lhcathome/result.php?resultid=221919833 The failed WU errored out on all four machines that got it. https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=111105143 EDIT: At around the same time, I got four Native Theory failures also, but with a different error message. I don't know if there is any relationship or not. https://lhcathome.cern.ch/lhcathome/results.php?hostid=10594192 |
Send message Joined: 2 May 07 Posts: 2240 Credit: 173,894,884 RAC: 3,757 |
|
Send message Joined: 28 Sep 04 Posts: 728 Credit: 48,810,609 RAC: 22,134 |
I'm getting some download errors for tasks that had a timeout on the first Host they were sent to. So files disappeared from server after one week. |
Send message Joined: 7 May 17 Posts: 10 Credit: 6,952,848 RAC: 0 |
I'm getting some download errors for tasks that had a timeout on the first Host they were sent to. So files disappeared from server after one week. I am seeing this too. All my download failures are tasks of WUs of which an earlier task failed by "Timed out - no response". |
Send message Joined: 7 Aug 14 Posts: 27 Credit: 10,000,233 RAC: 2,828 |
As well as these "Not started by deadline - canceled" |
Send message Joined: 13 May 14 Posts: 387 Credit: 15,314,184 RAC: 0 |
Ok, I found the problem. Our system for submitting tasks to BOINC had a timeout of 7 days on running tasks, after which it would cancel them and remove the input files. The BOINC deadline for ATLAS tasks is supposed to be 7 days but I see that they are actually timed out after 8 days. So this means that any task timing out would have the input files removed and hence the problem with downloads for retries. I've fixed this by increasing the timeout to 9 days on the ATLAS systems. When a WU times out it goes back to unsent and this timeout resets, so the problem should disappear. |
Send message Joined: 28 Sep 04 Posts: 728 Credit: 48,810,609 RAC: 22,134 |
Ok, I found the problem. Our system for submitting tasks to BOINC had a timeout of 7 days on running tasks, after which it would cancel them and remove the input files. The BOINC deadline for ATLAS tasks is supposed to be 7 days but I see that they are actually timed out after 8 days. So this means that any task timing out would have the input files removed and hence the problem with downloads for retries. I've fixed this by increasing the timeout to 9 days on the ATLAS systems. When a WU times out it goes back to unsent and this timeout resets, so the problem should disappear. That's good news. Could this explain the sudden spike on number of running jobs in the graphs? I hope so. |
Send message Joined: 7 May 17 Posts: 10 Credit: 6,952,848 RAC: 0 |
Harri Liljeroos wrote: Could this explain the sudden spike on number of running jobs in the graphs? I hope so. Do you mean an increase of tasks in progress? This is due to a Formula Boinc sprint. |
Send message Joined: 28 Sep 04 Posts: 728 Credit: 48,810,609 RAC: 22,134 |
Harri Liljeroos wrote:Could this explain the sudden spike on number of running jobs in the graphs? I hope so. OK, that's exactly what I meant. |
©2024 CERN