Message boards : ATLAS application : Download failures
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · Next

AuthorMessage
David Cameron
Project administrator
Project developer
Project scientist

Send message
Joined: 13 May 14
Posts: 387
Credit: 15,314,184
RAC: 0
Message 37547 - Posted: 5 Dec 2018, 15:06:13 UTC - in response to Message 37544.  

I think I know what's causing this. Recently many ATLAS WU were cancelled on the ATLAS side, so even though the input files were removed it looks like the WU were not properly cancelled on the BOINC server. I will check why this is not working properly and fix it.
ID: 37547 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2240
Credit: 173,894,884
RAC: 3,757
Message 37551 - Posted: 5 Dec 2018, 18:33:14 UTC

Server-Status is dropping down from 3k to 600 avalaible new Tasks in a few hours! 17:50 UTC
ID: 37551 · Report as offensive     Reply Quote
Profile HerveUAE
Avatar

Send message
Joined: 18 Dec 16
Posts: 123
Credit: 37,495,365
RAC: 0
Message 37554 - Posted: 5 Dec 2018, 18:58:23 UTC

On my side, out of the 4 files that are downloaded for each WU, all download OK except the big one (200Mbytes or more).
We are the product of random evolution.
ID: 37554 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2528
Credit: 253,722,201
RAC: 62,755
Message 37555 - Posted: 6 Dec 2018, 7:20:32 UTC

This morning one of my clients got stuck with the following message:
Do 06 Dez 2018 06:39:44 CET | LHC@home | Temporarily failed download of jf_25987fd8f8e73d45af1bb48d3c3e33a9: transient HTTP error

The filename without any extension seems to be unusual for an ATLAS tasks.

Reference:
https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=104421083

I had to cancel the download manually.
Other volunteers may be aware that the same WU is still active and could lead to a download lock.
ID: 37555 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2240
Credit: 173,894,884
RAC: 3,757
Message 37556 - Posted: 6 Dec 2018, 7:33:15 UTC - in response to Message 37555.  

+1
also canceled. All tasks later downloading normal.
ID: 37556 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2528
Credit: 253,722,201
RAC: 62,755
Message 37557 - Posted: 6 Dec 2018, 7:51:54 UTC - in response to Message 37555.  

@David Cameron

Could be a good idea to query the server DB for those malformed filenames and cancel the WUs before they are sent out again and again.

Thanks in advance.
ID: 37557 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2240
Credit: 173,894,884
RAC: 3,757
Message 37559 - Posted: 6 Dec 2018, 20:06:54 UTC

Since 18:50 UTC, some download Error are back!
ID: 37559 · Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project scientist

Send message
Joined: 13 May 14
Posts: 387
Credit: 15,314,184
RAC: 0
Message 37560 - Posted: 6 Dec 2018, 21:11:42 UTC - in response to Message 37559.  

We got another burst of cancelled tasks - I'm cancelling them on the BOINC server now.
ID: 37560 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2240
Credit: 173,894,884
RAC: 3,757
Message 38484 - Posted: 29 Mar 2019, 9:49:48 UTC - in response to Message 37560.  

David,
we have download-errors from 2018 June and July in our stats from Atlas:
WorkUnit not found, for example:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=199995337
Before this months come again in 2019,
is it a lot of work to eliminate them from the database?
Thank you.
ID: 38484 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2240
Credit: 173,894,884
RAC: 3,757
Message 38664 - Posted: 29 Apr 2019, 15:38:34 UTC
Last modified: 29 Apr 2019, 15:39:57 UTC

Today there are download errors sometime.with windows and native Linux:
WU download error: couldn't get input files:
<file_xfer_error>
<file_name>515KDmBJBfunlyackoJh5iwnABFKDmABFKDmnSSXDmABFKDmwBaSwm_EVNT.16926746._000684.pool.root.1</file_name>
<error_code>-224 (permanent HTTP error)</error_code>
<error_message>permanent HTTP error</error_message>
</file_xfer_error>
ID: 38664 · Report as offensive     Reply Quote
Profile PDW

Send message
Joined: 7 Aug 14
Posts: 27
Credit: 10,000,233
RAC: 2,828
Message 38665 - Posted: 29 Apr 2019, 18:39:59 UTC

Seeing a lot of these download errors today.

As a side note, it says Max number of errors is 3 but it always seem to download a 4th attempt which also fails.
Shouldn't it stop and bug out when it hits 3 errors ?

Example here: https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=111108234
ID: 38665 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Nov 14
Posts: 602
Credit: 24,371,321
RAC: 0
Message 38668 - Posted: 29 Apr 2019, 21:10:37 UTC - in response to Message 38665.  
Last modified: 29 Apr 2019, 21:25:19 UTC

Seeing a lot of these download errors today.

I just attached a machine again to get native ATLAS. The first one gave a download error, and then the next four were OK.
https://lhcathome.cern.ch/lhcathome/result.php?resultid=221919833

The failed WU errored out on all four machines that got it.
https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=111105143

EDIT: At around the same time, I got four Native Theory failures also, but with a different error message. I don't know if there is any relationship or not.
https://lhcathome.cern.ch/lhcathome/results.php?hostid=10594192
ID: 38668 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2240
Credit: 173,894,884
RAC: 3,757
Message 39704 - Posted: 23 Aug 2019, 14:34:59 UTC - in response to Message 31784.  

ID: 39704 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 728
Credit: 48,810,609
RAC: 22,134
Message 40015 - Posted: 24 Sep 2019, 9:41:16 UTC

I'm getting some download errors for tasks that had a timeout on the first Host they were sent to. So files disappeared from server after one week.
ID: 40015 · Report as offensive     Reply Quote
xii5ku

Send message
Joined: 7 May 17
Posts: 10
Credit: 6,952,848
RAC: 0
Message 40030 - Posted: 26 Sep 2019, 20:36:24 UTC - in response to Message 40015.  

I'm getting some download errors for tasks that had a timeout on the first Host they were sent to. So files disappeared from server after one week.

I am seeing this too.
All my download failures are tasks of WUs of which an earlier task failed by "Timed out - no response".
ID: 40030 · Report as offensive     Reply Quote
Profile PDW

Send message
Joined: 7 Aug 14
Posts: 27
Credit: 10,000,233
RAC: 2,828
Message 40033 - Posted: 27 Sep 2019, 8:58:39 UTC - in response to Message 40030.  

As well as these "Not started by deadline - canceled"
ID: 40033 · Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project scientist

Send message
Joined: 13 May 14
Posts: 387
Credit: 15,314,184
RAC: 0
Message 40035 - Posted: 27 Sep 2019, 10:12:06 UTC - in response to Message 40033.  

Ok, I found the problem. Our system for submitting tasks to BOINC had a timeout of 7 days on running tasks, after which it would cancel them and remove the input files. The BOINC deadline for ATLAS tasks is supposed to be 7 days but I see that they are actually timed out after 8 days. So this means that any task timing out would have the input files removed and hence the problem with downloads for retries. I've fixed this by increasing the timeout to 9 days on the ATLAS systems. When a WU times out it goes back to unsent and this timeout resets, so the problem should disappear.
ID: 40035 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 728
Credit: 48,810,609
RAC: 22,134
Message 40036 - Posted: 27 Sep 2019, 12:25:15 UTC - in response to Message 40035.  

Ok, I found the problem. Our system for submitting tasks to BOINC had a timeout of 7 days on running tasks, after which it would cancel them and remove the input files. The BOINC deadline for ATLAS tasks is supposed to be 7 days but I see that they are actually timed out after 8 days. So this means that any task timing out would have the input files removed and hence the problem with downloads for retries. I've fixed this by increasing the timeout to 9 days on the ATLAS systems. When a WU times out it goes back to unsent and this timeout resets, so the problem should disappear.

That's good news. Could this explain the sudden spike on number of running jobs in the graphs? I hope so.
ID: 40036 · Report as offensive     Reply Quote
xii5ku

Send message
Joined: 7 May 17
Posts: 10
Credit: 6,952,848
RAC: 0
Message 40043 - Posted: 29 Sep 2019, 11:28:21 UTC - in response to Message 40036.  

Harri Liljeroos wrote:
Could this explain the sudden spike on number of running jobs in the graphs? I hope so.

Do you mean an increase of tasks in progress?
This is due to a Formula Boinc sprint.
ID: 40043 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 728
Credit: 48,810,609
RAC: 22,134
Message 40051 - Posted: 30 Sep 2019, 12:19:15 UTC - in response to Message 40043.  

Harri Liljeroos wrote:
Could this explain the sudden spike on number of running jobs in the graphs? I hope so.

Do you mean an increase of tasks in progress?
This is due to a Formula Boinc sprint.

OK, that's exactly what I meant.
ID: 40051 · Report as offensive     Reply Quote
Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · Next

Message boards : ATLAS application : Download failures


©2024 CERN