Message boards :
ATLAS application :
Result upload failure
Message board moderation
Author | Message |
---|---|
Send message Joined: 22 Oct 07 Posts: 27 Credit: 808,821 RAC: 0 |
I have a continuing failure to upload a result from the ATLAS simulation. It ran for 32 hours, so I'd really like to get credit for it. I suspect part of the problem is that the result file is 175 MB, larger than other result files I've seen. The upload goes well until about 40% complete, then slows down. Somewhere between 50% and 66% the upload ends and restarts after a backup. Meanwhile, uploads for other tasks are going though without problem. Here is a log extract for the upload in question: 2017-12-13 14:55:01 | | [http_xfer] [ID#123] HTTP: wrote 16384 bytes 2017-12-13 14:55:01 | | [http_xfer] [ID#123] HTTP: wrote 16384 bytes 2017-12-13 14:55:01 | | [http_xfer] [ID#123] HTTP: wrote 16384 bytes 2017-12-13 14:55:01 | | [http_xfer] [ID#123] HTTP: wrote 16207 bytes 2017-12-13 14:56:38 | LHC@home | [http] HTTP error: Failure when receiving data from the peer 2017-12-13 14:56:39 | LHC@home | [file_xfer] http op done; retval -184 (transient HTTP error) 2017-12-13 14:56:39 | LHC@home | [file_xfer] file transfer status -184 (transient HTTP error) 2017-12-13 14:56:39 | LHC@home | Temporarily failed upload of mUsKDmvKVirnSu7Ccp2YYBZmABFKDmABFKDmtYMKDmABFKDmh4sYxm_0_r276664903_ATLAS_result: transient HTTP error 2017-12-13 14:56:39 | LHC@home | Backing off 04:20:48 on upload of mUsKDmvKVirnSu7Ccp2YYBZmABFKDmABFKDmtYMKDmABFKDmh4sYxm_0_r276664903_ATLAS_result 2017-12-13 14:56:40 | | Project communication failed: attempting access to reference site 2017-12-13 14:56:40 | | [http] HTTP_OP::init_get(): https://www.google.com/ The access to google was successful. If you need any other information, please let me know. |
Send message Joined: 9 Dec 14 Posts: 202 Credit: 2,533,875 RAC: 0 |
one of the file servers crashed over night look here: https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4536 |
Send message Joined: 22 Oct 07 Posts: 27 Credit: 808,821 RAC: 0 |
Thanks for the note. |
Send message Joined: 18 Dec 15 Posts: 1688 Credit: 103,436,542 RAC: 117,286 |
There are more postings about this problem in the ATLAS-thread. Although the problem now exists for 30 hours, it is still unsolved :-( I have numerous finished tasks waiting for upload, and hope that this will finally work before the end of these tasks' deadline. Would be a shame if dozens hours of computation time was for nothing :-( |
Send message Joined: 15 Jul 05 Posts: 242 Credit: 5,800,306 RAC: 0 |
My own ATLAS tasks backed off a couple of times and uploaded successfully after midnight. We have added another file server and have 3 upload servers in a load-balanced configuration. We are working on improvements to our setup to avoid this kind of timeouts. |
Send message Joined: 18 Dec 15 Posts: 1688 Credit: 103,436,542 RAC: 117,286 |
My own ATLAS tasks backed off a couple of times and uploaded successfully after midnight. ...What I notice is the change in behaviour of the tasks that are still waiting for being uploaded. Whereas yesterday the upload was progressing, then stopped at 100%, and then was reset and lateron again began from the scratch, today following happens: I click on "retry now", in the status column it shows "Upload active" for a time span of a few seconds to half a minute (while progress stays at 0.00%), and then it jumps back to "Upload retry in ... " (usually serveral hours). Any explanation for this? Are these tasks definitely "dead" and should I delete them (which I would hate to do with these huge tasks that ran for some 19 hours on 2-core). The particular problem with them now are the retry-intervals of 5 hours and above. This would mean that actually I would have to sit there and initiate the upload manually by pushing "retry now". |
Send message Joined: 15 Jul 05 Posts: 242 Credit: 5,800,306 RAC: 0 |
Please retry again later, or let the BOINC client retry later by itself. We are working on the fileserver setup now to improve performance. We also try to identify bad hosts that hammer our upload server and make things worse. Please be patient, and sorry for this. |
Send message Joined: 16 Sep 17 Posts: 100 Credit: 1,618,469 RAC: 0 |
We also try to identify bad hosts that hammer our upload server and make things worse. I.e. you confirm manually hitting "Retry Now" can result in a penalty? Please make sure not to blacklist eager volunteers . My system has been trying to get rid of tasks for "literally" days now. |
Send message Joined: 15 Jul 05 Posts: 242 Credit: 5,800,306 RAC: 0 |
No. Hitting retry now is not a problem. (I was referring to some badly configured hosts, not those that crunch correctly.) And we have more file server capacity now, so upload should work. |
Send message Joined: 18 Dec 15 Posts: 1688 Credit: 103,436,542 RAC: 117,286 |
Please retry again later, or let the BOINC client retry later by itself.Nils, I'd like to be patient - however, the deadline of the tasks in question is Dec. 16 (i.e. in 2 days). So if I let BOINC retry with it's intervals that are beyond 5 hours, I am afraid these tasks won't make it in time, and crunching time of dozens of hours (2-core) was for nothing. How can I avoid this to happen? |
Send message Joined: 18 Dec 15 Posts: 1688 Credit: 103,436,542 RAC: 117,286 |
No. Hitting retry now is not a problem. (I was referring to some badly configured hosts, not those that crunch correctly.) Since your above posting, I now have clicked on "retry now" many times - but always same behaviour of these tasks: under "status" it changes to "Upload active" for a few seconds, and then it reverts back to "Upload: retry in ... (value mostly 5 hours+)". From what I noticed, 2 newer tasks were uploaded this morning. They most probably got started after the server problem came up yesterday. The "problematic" tasks, that don't want to upload, were startet before yesterday's server problem. Could this fact play any role? |
Send message Joined: 16 Sep 17 Posts: 100 Credit: 1,618,469 RAC: 0 |
More recent information in the Atlas sub. They are investigating further issues with old WUs. Let's close this thread and continue at https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4536 |
©2024 CERN