Message boards : ATLAS application : Result upload failure
Message board moderation

To post messages, you must log in.

AuthorMessage
pls

Send message
Joined: 22 Oct 07
Posts: 27
Credit: 808,821
RAC: 0
Message 33321 - Posted: 13 Dec 2017, 22:10:04 UTC

I have a continuing failure to upload a result from the ATLAS simulation. It ran for 32 hours, so I'd really like to get credit for it. I suspect part of the problem is that the result file is 175 MB, larger than other result files I've seen. The upload goes well until about 40% complete, then slows down. Somewhere between 50% and 66% the upload ends and restarts after a backup. Meanwhile, uploads for other tasks are going though without problem.

Here is a log extract for the upload in question:

2017-12-13 14:55:01 | | [http_xfer] [ID#123] HTTP: wrote 16384 bytes
2017-12-13 14:55:01 | | [http_xfer] [ID#123] HTTP: wrote 16384 bytes
2017-12-13 14:55:01 | | [http_xfer] [ID#123] HTTP: wrote 16384 bytes
2017-12-13 14:55:01 | | [http_xfer] [ID#123] HTTP: wrote 16207 bytes
2017-12-13 14:56:38 | LHC@home | [http] HTTP error: Failure when receiving data from the peer
2017-12-13 14:56:39 | LHC@home | [file_xfer] http op done; retval -184 (transient HTTP error)
2017-12-13 14:56:39 | LHC@home | [file_xfer] file transfer status -184 (transient HTTP error)
2017-12-13 14:56:39 | LHC@home | Temporarily failed upload of mUsKDmvKVirnSu7Ccp2YYBZmABFKDmABFKDmtYMKDmABFKDmh4sYxm_0_r276664903_ATLAS_result: transient HTTP error
2017-12-13 14:56:39 | LHC@home | Backing off 04:20:48 on upload of mUsKDmvKVirnSu7Ccp2YYBZmABFKDmABFKDmtYMKDmABFKDmh4sYxm_0_r276664903_ATLAS_result
2017-12-13 14:56:40 | | Project communication failed: attempting access to reference site
2017-12-13 14:56:40 | | [http] HTTP_OP::init_get(): https://www.google.com/

The access to google was successful.

If you need any other information, please let me know.
ID: 33321 · Report as offensive
gyllic

Send message
Joined: 9 Dec 14
Posts: 202
Credit: 2,533,875
RAC: 0
Message 33322 - Posted: 13 Dec 2017, 22:31:05 UTC - in response to Message 33321.  

one of the file servers crashed over night

look here:
https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4536
ID: 33322 · Report as offensive
pls

Send message
Joined: 22 Oct 07
Posts: 27
Credit: 808,821
RAC: 0
Message 33325 - Posted: 14 Dec 2017, 3:49:34 UTC - in response to Message 33322.  

Thanks for the note.
ID: 33325 · Report as offensive
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,336,416
RAC: 102,116
Message 33328 - Posted: 14 Dec 2017, 6:37:31 UTC

There are more postings about this problem in the ATLAS-thread.

Although the problem now exists for 30 hours, it is still unsolved :-(
I have numerous finished tasks waiting for upload, and hope that this will finally work before the end of these tasks' deadline. Would be a shame if dozens hours of computation time was for nothing :-(
ID: 33328 · Report as offensive
Profile Nils Høimyr
Volunteer moderator
Project administrator
Project developer
Project tester

Send message
Joined: 15 Jul 05
Posts: 242
Credit: 5,800,306
RAC: 0
Message 33330 - Posted: 14 Dec 2017, 7:23:11 UTC - in response to Message 33321.  

My own ATLAS tasks backed off a couple of times and uploaded successfully after midnight.

We have added another file server and have 3 upload servers in a load-balanced configuration. We are working on improvements to our setup to avoid this kind of timeouts.
ID: 33330 · Report as offensive
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,336,416
RAC: 102,116
Message 33331 - Posted: 14 Dec 2017, 8:22:19 UTC - in response to Message 33330.  

My own ATLAS tasks backed off a couple of times and uploaded successfully after midnight. ...
What I notice is the change in behaviour of the tasks that are still waiting for being uploaded.
Whereas yesterday the upload was progressing, then stopped at 100%, and then was reset and lateron again began from the scratch, today following happens:

I click on "retry now", in the status column it shows "Upload active" for a time span of a few seconds to half a minute (while progress stays at 0.00%), and then it jumps back to "Upload retry in ... " (usually serveral hours).
Any explanation for this? Are these tasks definitely "dead" and should I delete them (which I would hate to do with these huge tasks that ran for some 19 hours on 2-core).
The particular problem with them now are the retry-intervals of 5 hours and above. This would mean that actually I would have to sit there and initiate the upload manually by pushing "retry now".
ID: 33331 · Report as offensive
Profile Nils Høimyr
Volunteer moderator
Project administrator
Project developer
Project tester

Send message
Joined: 15 Jul 05
Posts: 242
Credit: 5,800,306
RAC: 0
Message 33333 - Posted: 14 Dec 2017, 8:48:10 UTC - in response to Message 33331.  

Please retry again later, or let the BOINC client retry later by itself.

We are working on the fileserver setup now to improve performance. We also try to identify bad hosts that hammer our upload server and make things worse.

Please be patient, and sorry for this.
ID: 33333 · Report as offensive
AuxRx

Send message
Joined: 16 Sep 17
Posts: 100
Credit: 1,618,469
RAC: 0
Message 33335 - Posted: 14 Dec 2017, 9:17:04 UTC - in response to Message 33333.  

We also try to identify bad hosts that hammer our upload server and make things worse.


I.e. you confirm manually hitting "Retry Now" can result in a penalty?

Please make sure not to blacklist eager volunteers . My system has been trying to get rid of tasks for "literally" days now.
ID: 33335 · Report as offensive
Profile Nils Høimyr
Volunteer moderator
Project administrator
Project developer
Project tester

Send message
Joined: 15 Jul 05
Posts: 242
Credit: 5,800,306
RAC: 0
Message 33337 - Posted: 14 Dec 2017, 9:25:40 UTC - in response to Message 33335.  

No. Hitting retry now is not a problem. (I was referring to some badly configured hosts, not those that crunch correctly.)

And we have more file server capacity now, so upload should work.
ID: 33337 · Report as offensive
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,336,416
RAC: 102,116
Message 33338 - Posted: 14 Dec 2017, 9:30:58 UTC - in response to Message 33333.  

Please retry again later, or let the BOINC client retry later by itself.

We are working on the fileserver setup now to improve performance. We also try to identify bad hosts that hammer our upload server and make things worse.

Please be patient, and sorry for this.
Nils, I'd like to be patient - however, the deadline of the tasks in question is Dec. 16 (i.e. in 2 days). So if I let BOINC retry with it's intervals that are beyond 5 hours, I am afraid these tasks won't make it in time, and crunching time of dozens of hours (2-core) was for nothing.
How can I avoid this to happen?
ID: 33338 · Report as offensive
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,336,416
RAC: 102,116
Message 33342 - Posted: 14 Dec 2017, 11:02:31 UTC - in response to Message 33337.  

No. Hitting retry now is not a problem. (I was referring to some badly configured hosts, not those that crunch correctly.)

And we have more file server capacity now, so upload should work.

Since your above posting, I now have clicked on "retry now" many times - but always same behaviour of these tasks: under "status" it changes to "Upload active" for a few seconds, and then it reverts back to "Upload: retry in ... (value mostly 5 hours+)".

From what I noticed, 2 newer tasks were uploaded this morning. They most probably got started after the server problem came up yesterday.
The "problematic" tasks, that don't want to upload, were startet before yesterday's server problem. Could this fact play any role?
ID: 33342 · Report as offensive
AuxRx

Send message
Joined: 16 Sep 17
Posts: 100
Credit: 1,618,469
RAC: 0
Message 33343 - Posted: 14 Dec 2017, 11:08:04 UTC - in response to Message 33342.  
Last modified: 14 Dec 2017, 11:08:29 UTC

More recent information in the Atlas sub. They are investigating further issues with old WUs.

Let's close this thread and continue at https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4536
ID: 33343 · Report as offensive

Message boards : ATLAS application : Result upload failure


©2024 CERN