Uploads of finished tasks not possible since last night

Author	Message
Erich56 Send message Joined: 18 Dec 15 Posts: 1874 Credit: 137,378,724 RAC: 45,927	Message 33481 - Posted: 23 Dec 2017, 15:20:05 UTC - in response to Message 33464. ... Last retry went to 100% but still failed with transient HTTP error... this is the same sort of problem we experienced last week. So maybe the root cause for the problem is back :-( ID: 33481 · Reply Quote

Jim1348 Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0	Message 33485 - Posted: 23 Dec 2017, 18:16:06 UTC - in response to Message 33481. I had one that was stuck since yesterday, but just uploaded successfully after a manual retry. I am setting no new ATLAS work until next year. ID: 33485 · Reply Quote

PHILIPPE Send message Joined: 24 Jul 16 Posts: 88 Credit: 239,917 RAC: 0	Message 33486 - Posted: 23 Dec 2017, 18:20:52 UTC - in response to Message 33481. You are probably right, Erich , the solution found (deleting partial uploads with script every 6 hours) is temporary , untill the use of the new file systems for nfs server. But "maybe" there is another way to wait for this update. Processes et daemons inside the boinc server have different priorities for their execution. Under heavy load ,the partial uploads occur when the "handler of upload" stops one upload because another process with a higher priority or a same priority is running , creating a conflict which perturbs the upload and stopping it , finally,before its normal end. (I don't speak about isp failure or client computer crash which are external causes.) "Maybe" , to attenuate the problem , it would be worth giving the highest priority to the handler upload, (in order to produce less partial files) a higher priority to the deleter face to the transitionner (the most cpu intensive) (in order to clean and bring more space) , and a lower priority to the feeder and why not also to the scheduler . Under heavy load priority has to be given to output streams from the client , and not to the input ones , so the boinc server would less suffer on a long term. I can't say if it's possible and how it is feasible and if the result would be better but this is just an idea. A better setting of this parameter could enable a more confortable area of use for the server , under permanent overload.The bad consequence is that client would have less work-units while server is busy , but each client would end its upload inside the deadline in a more secure way. This is another way to think , (more or less efficient , i don't know ?( It depends on circumstances, certainly...) Having different options to this particular situation could provide more tools to fix the issue... ID: 33486 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1449 Credit: 9,730,720 RAC: 411	Message 33529 - Posted: 27 Dec 2017, 14:55:42 UTC Same issue again, but not cause server disk full. Meanwhile 6 upload retries 132MB loading up to 100% : LHC@home 5PMNDmzJJornDDn7oo6G73TpABFKDmABFKDmSWJKDmABFKDmPiD4km_0_r672436546_ATLAS_result Progress 76.347% Size 135214,91 K Speed 1485,77 Kbps Uploading and then: LHC@home 27 Dec 15:49:01 Temporarily failed upload of 5PMNDmzJJornDDn7oo6G73TpABFKDmABFKDmSWJKDmABFKDmPiD4km_0_r672436546_ATLAS_result: transient upload error ID: 33529 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1874 Credit: 137,378,724 RAC: 45,927	Message 33531 - Posted: 27 Dec 2017, 15:44:50 UTC this is the message BOINC gives me when I (re)try to upload finished ATLAS tasks: 27/12/2017 16:40:57 \| LHC@home \| [error] Error reported by file upload server: Server is out of disk space ID: 33531 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1874 Credit: 137,378,724 RAC: 45,927	Message 33535 - Posted: 27 Dec 2017, 17:43:40 UTC - in response to Message 33531. and now got the following error message: 27/12/2017 18:38:52 \| LHC@home \| [error] Error reported by file upload server: [eo6KDm3Q2nrnSu7Ccp2YYBZmABFKDmABFKDmWWIKDmABFKDmIIGYKo_0_r2063197282_ATLAS_result] locked by file_upload_handler PID=-1 seems like the server can't decide what it's problem is :-) ID: 33535 · Reply Quote

obele Send message Joined: 27 Aug 17 Posts: 1 Credit: 165,534 RAC: 47	Message 33536 - Posted: 27 Dec 2017, 19:50:04 UTC I think I have the same problem. Several times my ATLAS run tried to upload got 100% .... and restarted. On a manual start I 've seen a slow start at 350kbs and then an immediate jump to 20% load done - though it got 100% (161MB) at the end but ended also in restart in n hours. That sounds strange in my oppinion. I think it's not a network problem -more a matter of accepting and acknowledging the task completed. best regards ID: 33536 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1449 Credit: 9,730,720 RAC: 411	Message 33537 - Posted: 27 Dec 2017, 20:30:24 UTC After several more retries (not manual, but let BOINC do what it should), the upload succeeded. Before the success I meanwhile also got the message: LHC@home 27 Dec 17:04:55 [error] Error reported by file upload server: Server is out of disk space ID: 33537 · Reply Quote

Nils Volunteer moderator Project administrator Project developer Project tester Send message Joined: 15 Jul 05 Posts: 250 Credit: 5,974,599 RAC: 0	Message 33538 - Posted: 27 Dec 2017, 20:39:31 UTC - in response to Message 33537. Our storage space for uploads has been increased, but as there are many tasks queued, there might be temporary issues again. Sorry for this, and thanks for you contributions! ID: 33538 · Reply Quote

greg_be Send message Joined: 28 Dec 08 Posts: 341 Credit: 5,153,707 RAC: 132	Message 33539 - Posted: 28 Dec 2017, 1:06:20 UTC - in response to Message 33538. Still jammed up..I got an upload to 100% and it stalled and then restarted and can't upload now. Shutting down for the night, see what changes in 7 hrs. ID: 33539 · Reply Quote

nairb Send message Joined: 1 May 07 Posts: 29 Credit: 2,437,543 RAC: 69	Message 33546 - Posted: 28 Dec 2017, 11:44:01 UTC Seems to be stuck again.. 28/12/2017 11:42:11 \| LHC@home \| [error] Error reported by file upload server: [0ZbMDmof9nrnDDn7oo6G73TpABFKDmABFKDmxLFKDmABFKDmtodCCn_0_r1908771280_ATLAS_result] locked by file_upload_handler PID=-1 ID: 33546 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1874 Credit: 137,378,724 RAC: 45,927	Message 33611 - Posted: 1 Jan 2018, 7:49:07 UTC - in response to Message 33546. Seems to be stuck again.. 28/12/2017 11:42:11 \| LHC@home \| [error] Error reported by file upload server: [0ZbMDmof9nrnDDn7oo6G73TpABFKDmABFKDmxLFKDmABFKDmtodCCn_0_r1908771280_ATLAS_result] locked by file_upload_handler PID=-1 same thing here - a task which got finished several days ago can't upload: " locked by file_upload_handler PID=-1" another task which got finished during last night was uploaded right away. About 2-3 weeks ago, when there were these big problems caused by too many ATLAS tasks in the mills (thus straining too much the infrastructure there), David Cameron put into effect a tool which was intended to clean up partial uploads every 6 hours; hence, I am surprised that now, with a considerabely lower number of tasks in the mills (only about one third compared to before), there is still the "locked by file_upload_handler" problem. I am wondering if there is another problem now :-( ID: 33611 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1874 Credit: 137,378,724 RAC: 45,927	Message 33616 - Posted: 1 Jan 2018, 14:33:20 UTC - in response to Message 33611. same thing here - a task which got finished several days ago can't upload: " locked by file_upload_handler PID=-1" another task which got finished during last night was uploaded right away. just would like to report that this task was finally uploaded :-) ID: 33616 · Reply Quote

AuxRx Send message Joined: 16 Sep 17 Posts: 100 Credit: 1,618,469 RAC: 0	Message 33624 - Posted: 2 Jan 2018, 11:24:58 UTC - in response to Message 33616. I cancelled two hung uploads yesterday-ish. Very short run time, not much lost. I'd like to think it helped you return your results. :) ID: 33624 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1874 Credit: 137,378,724 RAC: 45,927	Message 33626 - Posted: 2 Jan 2018, 11:40:21 UTC - in response to Message 33624. I'd like to think it helped you return your results. :) haha, many thanks :-) ID: 33626 · Reply Quote

gyllic Send message Joined: 9 Dec 14 Posts: 202 Credit: 2,533,875 RAC: 0	Message 33847 - Posted: 14 Jan 2018, 13:25:33 UTC Again server problems? Yesturday i had the " locked by file_upload_handler PID=-1" error (the results are uploaded by now) and today i have the "transient http error": 14.01.2018 11:41:42 \| LHC@home \| Starting task AIKLDm7iuurnDDn7oo6G73TpABFKDmABFKDmZPHKDmABFKDm6jOAWn_0 14.01.2018 14:13:45 \| LHC@home \| Computation for task AIKLDm7iuurnDDn7oo6G73TpABFKDmABFKDmZPHKDmABFKDm6jOAWn_0 finished 14.01.2018 14:13:48 \| LHC@home \| Started upload of AIKLDm7iuurnDDn7oo6G73TpABFKDmABFKDmZPHKDmABFKDm6jOAWn_0_r1909380871_ATLAS_result 14.01.2018 14:15:04 \| \| Project communication failed: attempting access to reference site 14.01.2018 14:15:04 \| LHC@home \| Temporarily failed upload of AIKLDm7iuurnDDn7oo6G73TpABFKDmABFKDmZPHKDmABFKDm6jOAWn_0_r1909380871_ATLAS_result: transient HTTP error 14.01.2018 14:15:04 \| LHC@home \| Backing off 00:02:34 on upload of AIKLDm7iuurnDDn7oo6G73TpABFKDmABFKDmZPHKDmABFKDm6jOAWn_0_r1909380871_ATLAS_result 14.01.2018 14:15:08 \| \| Internet access OK - project servers may be temporarily down. ID: 33847 · Reply Quote

Michael H.W. Weber Send message Joined: 18 Sep 04 Posts: 30 Credit: 5,100,929 RAC: 0	Message 33877 - Posted: 16 Jan 2018, 9:29:49 UTC Last modified: 16 Jan 2018, 9:46:40 UTC I have an ATLAS task not uploading since many, many days: 16.01.2018 08:55:48 \| LHC@home \| Started upload of WqtNDme5DvrnDDn7oo6G73TpABFKDmABFKDm8sHKDmABFKDmT56S3n_0_r205511374_ATLAS_result 16.01.2018 09:00:55 \| LHC@home \| Temporarily failed upload of WqtNDme5DvrnDDn7oo6G73TpABFKDmABFKDm8sHKDmABFKDmT56S3n_0_r205511374_ATLAS_result: transient HTTP error 16.01.2018 09:00:55 \| LHC@home \| Backing off 04:15:14 on upload of WqtNDme5DvrnDDn7oo6G73TpABFKDmABFKDm8sHKDmABFKDmT56S3n_0_r205511374_ATLAS_result 16.01.2018 10:12:25 \| LHC@home \| Started upload of WqtNDme5DvrnDDn7oo6G73TpABFKDmABFKDm8sHKDmABFKDmT56S3n_0_r205511374_ATLAS_result 16.01.2018 10:12:47 \| LHC@home \| Temporarily failed upload of WqtNDme5DvrnDDn7oo6G73TpABFKDmABFKDm8sHKDmABFKDmT56S3n_0_r205511374_ATLAS_result: connect() failed 16.01.2018 10:12:47 \| LHC@home \| Backing off 03:54:55 on upload of WqtNDme5DvrnDDn7oo6G73TpABFKDmABFKDm8sHKDmABFKDmT56S3n_0_r205511374_ATLAS_result Strangely, the ATLAS task data listed in my account is not consistent with the data displayed in my client: While tha task date is identical, download and due dates differ. The task was neither delivered on 15th of January by the server (instead many days earlier) nor has it to be complete on 23rd of January (but on 22nd). Do you have a database problem? Michael. ID: 33877 · Reply Quote

Harri Liljeroos Send message Joined: 28 Sep 04 Posts: 763 Credit: 56,478,114 RAC: 30,240	Message 33879 - Posted: 16 Jan 2018, 10:29:43 UTC - in response to Message 33877. The due dates (deadline) differ by one day for LHC tasks. Boinc manager says that due date is one day earlier than server. I have never seen an explanation why, but it has been like this for years. ID: 33879 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1874 Credit: 137,378,724 RAC: 45,927	Message 34015 - Posted: 21 Jan 2018, 15:01:35 UTC For quite a while now, ATLAS uploads fail with "server out of disk space". ID: 34015 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1874 Credit: 137,378,724 RAC: 45,927	Message 34020 - Posted: 22 Jan 2018, 6:27:26 UTC - in response to Message 34015. Last modified: 22 Jan 2018, 6:30:20 UTC the error notices seem to change from time to time: since last night, it always says "locked by upload handler" and "transient upload error" - the same what we had from Mid-December on most of the time. Meanwhile, the number of "unsent" ATLAS tasks on the Project Status Page is "0" - which is best they can do, anyway. I think it does not make any sense to send out ATLAS tasks for crunching as long as all these severe file transfer (and other) problems persist. ID: 34020 · Reply Quote

LHC@home