Download failures

Author	Message
David Cameron Project administrator Project developer Project scientist Send message Joined: 13 May 14 Posts: 387 Credit: 15,314,184 RAC: 0	Message 37547 - Posted: 5 Dec 2018, 15:06:13 UTC - in response to Message 37544. I think I know what's causing this. Recently many ATLAS WU were cancelled on the ATLAS side, so even though the input files were removed it looks like the WU were not properly cancelled on the BOINC server. I will check why this is not working properly and fix it. ID: 37547 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2103 Credit: 159,819,191 RAC: 123,837	Message 37551 - Posted: 5 Dec 2018, 18:33:14 UTC Server-Status is dropping down from 3k to 600 avalaible new Tasks in a few hours! 17:50 UTC ID: 37551 · Reply Quote

HerveUAE Send message Joined: 18 Dec 16 Posts: 123 Credit: 37,495,365 RAC: 0	Message 37554 - Posted: 5 Dec 2018, 18:58:23 UTC On my side, out of the 4 files that are downloaded for each WU, all download OK except the big one (200Mbytes or more). We are the product of random evolution. ID: 37554 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2413 Credit: 226,576,741 RAC: 130,913	Message 37555 - Posted: 6 Dec 2018, 7:20:32 UTC This morning one of my clients got stuck with the following message: Do 06 Dez 2018 06:39:44 CET \| LHC@home \| Temporarily failed download of jf_25987fd8f8e73d45af1bb48d3c3e33a9: transient HTTP error The filename without any extension seems to be unusual for an ATLAS tasks. Reference: https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=104421083 I had to cancel the download manually. Other volunteers may be aware that the same WU is still active and could lead to a download lock. ID: 37555 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2103 Credit: 159,819,191 RAC: 123,837	Message 37556 - Posted: 6 Dec 2018, 7:33:15 UTC - in response to Message 37555. +1 also canceled. All tasks later downloading normal. ID: 37556 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2413 Credit: 226,576,741 RAC: 130,913	Message 37557 - Posted: 6 Dec 2018, 7:51:54 UTC - in response to Message 37555. @David Cameron Could be a good idea to query the server DB for those malformed filenames and cancel the WUs before they are sent out again and again. Thanks in advance. ID: 37557 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2103 Credit: 159,819,191 RAC: 123,837	Message 37559 - Posted: 6 Dec 2018, 20:06:54 UTC Since 18:50 UTC, some download Error are back! ID: 37559 · Reply Quote

David Cameron Project administrator Project developer Project scientist Send message Joined: 13 May 14 Posts: 387 Credit: 15,314,184 RAC: 0	Message 37560 - Posted: 6 Dec 2018, 21:11:42 UTC - in response to Message 37559. We got another burst of cancelled tasks - I'm cancelling them on the BOINC server now. ID: 37560 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2103 Credit: 159,819,191 RAC: 123,837	Message 38484 - Posted: 29 Mar 2019, 9:49:48 UTC - in response to Message 37560. David, we have download-errors from 2018 June and July in our stats from Atlas: WorkUnit not found, for example: https://lhcathome.cern.ch/lhcathome/result.php?resultid=199995337 Before this months come again in 2019, is it a lot of work to eliminate them from the database? Thank you. ID: 38484 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2103 Credit: 159,819,191 RAC: 123,837	Message 38664 - Posted: 29 Apr 2019, 15:38:34 UTC Last modified: 29 Apr 2019, 15:39:57 UTC Today there are download errors sometime.with windows and native Linux: WU download error: couldn't get input files: <file_xfer_error> <file_name>515KDmBJBfunlyackoJh5iwnABFKDmABFKDmnSSXDmABFKDmwBaSwm_EVNT.16926746._000684.pool.root.1</file_name> <error_code>-224 (permanent HTTP error)</error_code> <error_message>permanent HTTP error</error_message> </file_xfer_error> ID: 38664 · Reply Quote

PDW Send message Joined: 7 Aug 14 Posts: 14 Credit: 7,512,660 RAC: 1,574	Message 38665 - Posted: 29 Apr 2019, 18:39:59 UTC Seeing a lot of these download errors today. As a side note, it says Max number of errors is 3 but it always seem to download a 4th attempt which also fails. Shouldn't it stop and bug out when it hits 3 errors ? Example here: https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=111108234 ID: 38665 · Reply Quote

Jim1348 Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0	Message 38668 - Posted: 29 Apr 2019, 21:10:37 UTC - in response to Message 38665. Last modified: 29 Apr 2019, 21:25:19 UTC Seeing a lot of these download errors today. I just attached a machine again to get native ATLAS. The first one gave a download error, and then the next four were OK. https://lhcathome.cern.ch/lhcathome/result.php?resultid=221919833 The failed WU errored out on all four machines that got it. https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=111105143 EDIT: At around the same time, I got four Native Theory failures also, but with a different error message. I don't know if there is any relationship or not. https://lhcathome.cern.ch/lhcathome/results.php?hostid=10594192 ID: 38668 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2103 Credit: 159,819,191 RAC: 123,837	Message 39704 - Posted: 23 Aug 2019, 14:34:59 UTC - in response to Message 31784. https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=122718096 ID: 39704 · Reply Quote

Harri Liljeroos Send message Joined: 28 Sep 04 Posts: 675 Credit: 43,667,377 RAC: 15,861	Message 40015 - Posted: 24 Sep 2019, 9:41:16 UTC I'm getting some download errors for tasks that had a timeout on the first Host they were sent to. So files disappeared from server after one week. ID: 40015 · Reply Quote

xii5ku Send message Joined: 7 May 17 Posts: 10 Credit: 6,952,848 RAC: 0	Message 40030 - Posted: 26 Sep 2019, 20:36:24 UTC - in response to Message 40015. I'm getting some download errors for tasks that had a timeout on the first Host they were sent to. So files disappeared from server after one week. I am seeing this too. All my download failures are tasks of WUs of which an earlier task failed by "Timed out - no response". ID: 40030 · Reply Quote

PDW Send message Joined: 7 Aug 14 Posts: 14 Credit: 7,512,660 RAC: 1,574	Message 40033 - Posted: 27 Sep 2019, 8:58:39 UTC - in response to Message 40030. As well as these "Not started by deadline - canceled" ID: 40033 · Reply Quote

David Cameron Project administrator Project developer Project scientist Send message Joined: 13 May 14 Posts: 387 Credit: 15,314,184 RAC: 0	Message 40035 - Posted: 27 Sep 2019, 10:12:06 UTC - in response to Message 40033. Ok, I found the problem. Our system for submitting tasks to BOINC had a timeout of 7 days on running tasks, after which it would cancel them and remove the input files. The BOINC deadline for ATLAS tasks is supposed to be 7 days but I see that they are actually timed out after 8 days. So this means that any task timing out would have the input files removed and hence the problem with downloads for retries. I've fixed this by increasing the timeout to 9 days on the ATLAS systems. When a WU times out it goes back to unsent and this timeout resets, so the problem should disappear. ID: 40035 · Reply Quote

Harri Liljeroos Send message Joined: 28 Sep 04 Posts: 675 Credit: 43,667,377 RAC: 15,861	Message 40036 - Posted: 27 Sep 2019, 12:25:15 UTC - in response to Message 40035. Ok, I found the problem. Our system for submitting tasks to BOINC had a timeout of 7 days on running tasks, after which it would cancel them and remove the input files. The BOINC deadline for ATLAS tasks is supposed to be 7 days but I see that they are actually timed out after 8 days. So this means that any task timing out would have the input files removed and hence the problem with downloads for retries. I've fixed this by increasing the timeout to 9 days on the ATLAS systems. When a WU times out it goes back to unsent and this timeout resets, so the problem should disappear. That's good news. Could this explain the sudden spike on number of running jobs in the graphs? I hope so. ID: 40036 · Reply Quote

xii5ku Send message Joined: 7 May 17 Posts: 10 Credit: 6,952,848 RAC: 0	Message 40043 - Posted: 29 Sep 2019, 11:28:21 UTC - in response to Message 40036. Harri Liljeroos wrote: Could this explain the sudden spike on number of running jobs in the graphs? I hope so. Do you mean an increase of tasks in progress? This is due to a Formula Boinc sprint. ID: 40043 · Reply Quote

Harri Liljeroos Send message Joined: 28 Sep 04 Posts: 675 Credit: 43,667,377 RAC: 15,861	Message 40051 - Posted: 30 Sep 2019, 12:19:15 UTC - in response to Message 40043. Harri Liljeroos wrote: Could this explain the sudden spike on number of running jobs in the graphs? I hope so. Do you mean an increase of tasks in progress? This is due to a Formula Boinc sprint. OK, that's exactly what I meant. ID: 40051 · Reply Quote

LHC@home