Thread 'Result upload failure'

Author	Message
pls Send message Joined: 22 Oct 07 Posts: 27 Credit: 808,821 RAC: 0	Message 33321 - Posted: 13 Dec 2017, 22:10:04 UTC I have a continuing failure to upload a result from the ATLAS simulation. It ran for 32 hours, so I'd really like to get credit for it. I suspect part of the problem is that the result file is 175 MB, larger than other result files I've seen. The upload goes well until about 40% complete, then slows down. Somewhere between 50% and 66% the upload ends and restarts after a backup. Meanwhile, uploads for other tasks are going though without problem. Here is a log extract for the upload in question: 2017-12-13 14:55:01 \| \| [http_xfer] [ID#123] HTTP: wrote 16384 bytes 2017-12-13 14:55:01 \| \| [http_xfer] [ID#123] HTTP: wrote 16384 bytes 2017-12-13 14:55:01 \| \| [http_xfer] [ID#123] HTTP: wrote 16384 bytes 2017-12-13 14:55:01 \| \| [http_xfer] [ID#123] HTTP: wrote 16207 bytes 2017-12-13 14:56:38 \| LHC@home \| [http] HTTP error: Failure when receiving data from the peer 2017-12-13 14:56:39 \| LHC@home \| [file_xfer] http op done; retval -184 (transient HTTP error) 2017-12-13 14:56:39 \| LHC@home \| [file_xfer] file transfer status -184 (transient HTTP error) 2017-12-13 14:56:39 \| LHC@home \| Temporarily failed upload of mUsKDmvKVirnSu7Ccp2YYBZmABFKDmABFKDmtYMKDmABFKDmh4sYxm_0_r276664903_ATLAS_result: transient HTTP error 2017-12-13 14:56:39 \| LHC@home \| Backing off 04:20:48 on upload of mUsKDmvKVirnSu7Ccp2YYBZmABFKDmABFKDmtYMKDmABFKDmh4sYxm_0_r276664903_ATLAS_result 2017-12-13 14:56:40 \| \| Project communication failed: attempting access to reference site 2017-12-13 14:56:40 \| \| [http] HTTP_OP::init_get(): https://www.google.com/ The access to google was successful. If you need any other information, please let me know. ID: 33321 ·

gyllic Send message Joined: 9 Dec 14 Posts: 202 Credit: 2,659,192 RAC: 9	Message 33322 - Posted: 13 Dec 2017, 22:31:05 UTC - in response to Message 33321. one of the file servers crashed over night look here: https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4536 ID: 33322 ·

pls Send message Joined: 22 Oct 07 Posts: 27 Credit: 808,821 RAC: 0	Message 33325 - Posted: 14 Dec 2017, 3:49:34 UTC - in response to Message 33322. Thanks for the note. ID: 33325 ·

Erich56 Send message Joined: 18 Dec 15 Posts: 1957 Credit: 158,786,001 RAC: 53,970	Message 33328 - Posted: 14 Dec 2017, 6:37:31 UTC There are more postings about this problem in the ATLAS-thread. Although the problem now exists for 30 hours, it is still unsolved :-( I have numerous finished tasks waiting for upload, and hope that this will finally work before the end of these tasks' deadline. Would be a shame if dozens hours of computation time was for nothing :-( ID: 33328 ·

Nils Volunteer moderator Project administrator Project developer Project tester Send message Joined: 15 Jul 05 Posts: 251 Credit: 6,001,083 RAC: 0	Message 33330 - Posted: 14 Dec 2017, 7:23:11 UTC - in response to Message 33321. My own ATLAS tasks backed off a couple of times and uploaded successfully after midnight. We have added another file server and have 3 upload servers in a load-balanced configuration. We are working on improvements to our setup to avoid this kind of timeouts. ID: 33330 ·

Erich56 Send message Joined: 18 Dec 15 Posts: 1957 Credit: 158,786,001 RAC: 53,970	Message 33331 - Posted: 14 Dec 2017, 8:22:19 UTC - in response to Message 33330. My own ATLAS tasks backed off a couple of times and uploaded successfully after midnight. ... What I notice is the change in behaviour of the tasks that are still waiting for being uploaded. Whereas yesterday the upload was progressing, then stopped at 100%, and then was reset and lateron again began from the scratch, today following happens: I click on "retry now", in the status column it shows "Upload active" for a time span of a few seconds to half a minute (while progress stays at 0.00%), and then it jumps back to "Upload retry in ... " (usually serveral hours). Any explanation for this? Are these tasks definitely "dead" and should I delete them (which I would hate to do with these huge tasks that ran for some 19 hours on 2-core). The particular problem with them now are the retry-intervals of 5 hours and above. This would mean that actually I would have to sit there and initiate the upload manually by pushing "retry now". ID: 33331 ·

Nils Volunteer moderator Project administrator Project developer Project tester Send message Joined: 15 Jul 05 Posts: 251 Credit: 6,001,083 RAC: 0	Message 33333 - Posted: 14 Dec 2017, 8:48:10 UTC - in response to Message 33331. Please retry again later, or let the BOINC client retry later by itself. We are working on the fileserver setup now to improve performance. We also try to identify bad hosts that hammer our upload server and make things worse. Please be patient, and sorry for this. ID: 33333 ·

AuxRx Send message Joined: 16 Sep 17 Posts: 100 Credit: 1,618,469 RAC: 0	Message 33335 - Posted: 14 Dec 2017, 9:17:04 UTC - in response to Message 33333. We also try to identify bad hosts that hammer our upload server and make things worse. I.e. you confirm manually hitting "Retry Now" can result in a penalty? Please make sure not to blacklist eager volunteers . My system has been trying to get rid of tasks for "literally" days now. ID: 33335 ·

Nils Volunteer moderator Project administrator Project developer Project tester Send message Joined: 15 Jul 05 Posts: 251 Credit: 6,001,083 RAC: 0	Message 33337 - Posted: 14 Dec 2017, 9:25:40 UTC - in response to Message 33335. No. Hitting retry now is not a problem. (I was referring to some badly configured hosts, not those that crunch correctly.) And we have more file server capacity now, so upload should work. ID: 33337 ·

Erich56 Send message Joined: 18 Dec 15 Posts: 1957 Credit: 158,786,001 RAC: 53,970	Message 33338 - Posted: 14 Dec 2017, 9:30:58 UTC - in response to Message 33333. Please retry again later, or let the BOINC client retry later by itself. We are working on the fileserver setup now to improve performance. We also try to identify bad hosts that hammer our upload server and make things worse. Please be patient, and sorry for this. Nils, I'd like to be patient - however, the deadline of the tasks in question is Dec. 16 (i.e. in 2 days). So if I let BOINC retry with it's intervals that are beyond 5 hours, I am afraid these tasks won't make it in time, and crunching time of dozens of hours (2-core) was for nothing. How can I avoid this to happen? ID: 33338 ·

Erich56 Send message Joined: 18 Dec 15 Posts: 1957 Credit: 158,786,001 RAC: 53,970	Message 33342 - Posted: 14 Dec 2017, 11:02:31 UTC - in response to Message 33337. No. Hitting retry now is not a problem. (I was referring to some badly configured hosts, not those that crunch correctly.) And we have more file server capacity now, so upload should work. Since your above posting, I now have clicked on "retry now" many times - but always same behaviour of these tasks: under "status" it changes to "Upload active" for a few seconds, and then it reverts back to "Upload: retry in ... (value mostly 5 hours+)". From what I noticed, 2 newer tasks were uploaded this morning. They most probably got started after the server problem came up yesterday. The "problematic" tasks, that don't want to upload, were startet before yesterday's server problem. Could this fact play any role? ID: 33342 ·

AuxRx Send message Joined: 16 Sep 17 Posts: 100 Credit: 1,618,469 RAC: 0	Message 33343 - Posted: 14 Dec 2017, 11:08:04 UTC - in response to Message 33342. Last modified: 14 Dec 2017, 11:08:29 UTC More recent information in the Atlas sub. They are investigating further issues with old WUs. Let's close this thread and continue at https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4536 ID: 33343 ·