Message boards : ATLAS application : Uploads of finished tasks not possible since last night
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 3 · 4 · 5 · 6

AuthorMessage
Erich56

Send message
Joined: 18 Dec 15
Posts: 1814
Credit: 118,498,107
RAC: 30,817
Message 33481 - Posted: 23 Dec 2017, 15:20:05 UTC - in response to Message 33464.  

... Last retry went to 100% but still failed with transient HTTP error...
this is the same sort of problem we experienced last week. So maybe the root cause for the problem is back :-(
ID: 33481 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Nov 14
Posts: 602
Credit: 24,371,321
RAC: 0
Message 33485 - Posted: 23 Dec 2017, 18:16:06 UTC - in response to Message 33481.  

I had one that was stuck since yesterday, but just uploaded successfully after a manual retry.
I am setting no new ATLAS work until next year.
ID: 33485 · Report as offensive     Reply Quote
PHILIPPE

Send message
Joined: 24 Jul 16
Posts: 88
Credit: 239,917
RAC: 0
Message 33486 - Posted: 23 Dec 2017, 18:20:52 UTC - in response to Message 33481.  

You are probably right, Erich ,
the solution found (deleting partial uploads with script every 6 hours) is temporary , untill the use of the new file systems for nfs server.
But "maybe" there is another way to wait for this update.
Processes et daemons inside the boinc server have different priorities for their execution.
Under heavy load ,the partial uploads occur when the "handler of upload" stops one upload because another process with a higher priority or a same priority is running , creating a conflict which perturbs the upload and stopping it , finally,before its normal end.
(I don't speak about isp failure or client computer crash which are external causes.)
"Maybe" , to attenuate the problem , it would be worth giving
    the highest priority to the handler upload, (in order to produce less partial files)
    a higher priority to the deleter face to the transitionner (the most cpu intensive) (in order to clean and bring more space) ,
    and a lower priority to the feeder and why not also to the scheduler .


Under heavy load priority has to be given to output streams from the client , and not to the input ones , so the boinc server would less suffer on a long term.
I can't say if it's possible and how it is feasible and if the result would be better but this is just an idea.
A better setting of this parameter could enable a more confortable area of use for the server , under permanent overload.The bad consequence is that client would have less work-units while server is busy , but each client would end its upload inside the deadline in a more secure way.
This is another way to think , (more or less efficient , i don't know ?( It depends on circumstances, certainly...)
Having different options to this particular situation could provide more tools to fix the issue...

ID: 33486 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1418
Credit: 9,470,586
RAC: 3,147
Message 33529 - Posted: 27 Dec 2017, 14:55:42 UTC

Same issue again, but not cause server disk full. Meanwhile 6 upload retries 132MB loading up to 100% :

LHC@home 5PMNDmzJJornDDn7oo6G73TpABFKDmABFKDmSWJKDmABFKDmPiD4km_0_r672436546_ATLAS_result Progress 76.347% Size 135214,91 K Speed 1485,77 Kbps Uploading

and then:

LHC@home 27 Dec 15:49:01 Temporarily failed upload of 5PMNDmzJJornDDn7oo6G73TpABFKDmABFKDmSWJKDmABFKDmPiD4km_0_r672436546_ATLAS_result: transient upload error
ID: 33529 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1814
Credit: 118,498,107
RAC: 30,817
Message 33531 - Posted: 27 Dec 2017, 15:44:50 UTC

this is the message BOINC gives me when I (re)try to upload finished ATLAS tasks:

27/12/2017 16:40:57 | LHC@home | [error] Error reported by file upload server: Server is out of disk space
ID: 33531 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1814
Credit: 118,498,107
RAC: 30,817
Message 33535 - Posted: 27 Dec 2017, 17:43:40 UTC - in response to Message 33531.  

and now got the following error message:

27/12/2017 18:38:52 | LHC@home | [error] Error reported by file upload server: [eo6KDm3Q2nrnSu7Ccp2YYBZmABFKDmABFKDmWWIKDmABFKDmIIGYKo_0_r2063197282_ATLAS_result] locked by file_upload_handler PID=-1

seems like the server can't decide what it's problem is :-)
ID: 33535 · Report as offensive     Reply Quote
obele

Send message
Joined: 27 Aug 17
Posts: 1
Credit: 156,031
RAC: 0
Message 33536 - Posted: 27 Dec 2017, 19:50:04 UTC

I think I have the same problem. Several times my ATLAS run tried to upload got 100% .... and restarted.
On a manual start I 've seen a slow start at 350kbs and then an immediate jump to 20% load done - though it got 100% (161MB) at the end but ended also in restart in n hours.
That sounds strange in my oppinion. I think it's not a network problem -more a matter of accepting and acknowledging the task completed.
best regards
ID: 33536 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1418
Credit: 9,470,586
RAC: 3,147
Message 33537 - Posted: 27 Dec 2017, 20:30:24 UTC

After several more retries (not manual, but let BOINC do what it should), the upload succeeded.

Before the success I meanwhile also got the message: LHC@home 27 Dec 17:04:55 [error] Error reported by file upload server: Server is out of disk space
ID: 33537 · Report as offensive     Reply Quote
Profile Nils Høimyr
Volunteer moderator
Project administrator
Project developer
Project tester

Send message
Joined: 15 Jul 05
Posts: 248
Credit: 5,974,599
RAC: 0
Message 33538 - Posted: 27 Dec 2017, 20:39:31 UTC - in response to Message 33537.  

Our storage space for uploads has been increased, but as there are many tasks queued, there might be temporary issues again. Sorry for this, and thanks for you contributions!
ID: 33538 · Report as offensive     Reply Quote
greg_be

Send message
Joined: 28 Dec 08
Posts: 339
Credit: 4,863,589
RAC: 282
Message 33539 - Posted: 28 Dec 2017, 1:06:20 UTC - in response to Message 33538.  

Still jammed up..I got an upload to 100% and it stalled and then restarted and can't upload now.
Shutting down for the night, see what changes in 7 hrs.
ID: 33539 · Report as offensive     Reply Quote
nairb

Send message
Joined: 1 May 07
Posts: 27
Credit: 2,336,992
RAC: 1
Message 33546 - Posted: 28 Dec 2017, 11:44:01 UTC

Seems to be stuck again..

28/12/2017 11:42:11 | LHC@home | [error] Error reported by file upload server: [0ZbMDmof9nrnDDn7oo6G73TpABFKDmABFKDmxLFKDmABFKDmtodCCn_0_r1908771280_ATLAS_result] locked by file_upload_handler PID=-1
ID: 33546 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1814
Credit: 118,498,107
RAC: 30,817
Message 33611 - Posted: 1 Jan 2018, 7:49:07 UTC - in response to Message 33546.  

Seems to be stuck again..

28/12/2017 11:42:11 | LHC@home | [error] Error reported by file upload server: [0ZbMDmof9nrnDDn7oo6G73TpABFKDmABFKDmxLFKDmABFKDmtodCCn_0_r1908771280_ATLAS_result] locked by file_upload_handler PID=-1

same thing here - a task which got finished several days ago can't upload: " locked by file_upload_handler PID=-1"
another task which got finished during last night was uploaded right away.

About 2-3 weeks ago, when there were these big problems caused by too many ATLAS tasks in the mills (thus straining too much the infrastructure there), David Cameron put into effect a tool which was intended to clean up partial uploads every 6 hours; hence, I am surprised that now, with a considerabely lower number of tasks in the mills (only about one third compared to before), there is still the "locked by file_upload_handler" problem.

I am wondering if there is another problem now :-(
ID: 33611 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1814
Credit: 118,498,107
RAC: 30,817
Message 33616 - Posted: 1 Jan 2018, 14:33:20 UTC - in response to Message 33611.  

same thing here - a task which got finished several days ago can't upload: " locked by file_upload_handler PID=-1"
another task which got finished during last night was uploaded right away.
just would like to report that this task was finally uploaded :-)
ID: 33616 · Report as offensive     Reply Quote
AuxRx

Send message
Joined: 16 Sep 17
Posts: 100
Credit: 1,618,469
RAC: 0
Message 33624 - Posted: 2 Jan 2018, 11:24:58 UTC - in response to Message 33616.  

I cancelled two hung uploads yesterday-ish. Very short run time, not much lost. I'd like to think it helped you return your results. :)
ID: 33624 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1814
Credit: 118,498,107
RAC: 30,817
Message 33626 - Posted: 2 Jan 2018, 11:40:21 UTC - in response to Message 33624.  

I'd like to think it helped you return your results. :)
haha, many thanks :-)
ID: 33626 · Report as offensive     Reply Quote
gyllic

Send message
Joined: 9 Dec 14
Posts: 202
Credit: 2,533,875
RAC: 0
Message 33847 - Posted: 14 Jan 2018, 13:25:33 UTC

Again server problems?
Yesturday i had the " locked by file_upload_handler PID=-1" error (the results are uploaded by now) and today i have the "transient http error":
14.01.2018 11:41:42 | LHC@home | Starting task AIKLDm7iuurnDDn7oo6G73TpABFKDmABFKDmZPHKDmABFKDm6jOAWn_0
14.01.2018 14:13:45 | LHC@home | Computation for task AIKLDm7iuurnDDn7oo6G73TpABFKDmABFKDmZPHKDmABFKDm6jOAWn_0 finished
14.01.2018 14:13:48 | LHC@home | Started upload of AIKLDm7iuurnDDn7oo6G73TpABFKDmABFKDmZPHKDmABFKDm6jOAWn_0_r1909380871_ATLAS_result
14.01.2018 14:15:04 |  | Project communication failed: attempting access to reference site
14.01.2018 14:15:04 | LHC@home | Temporarily failed upload of AIKLDm7iuurnDDn7oo6G73TpABFKDmABFKDmZPHKDmABFKDm6jOAWn_0_r1909380871_ATLAS_result: transient HTTP error
14.01.2018 14:15:04 | LHC@home | Backing off 00:02:34 on upload of AIKLDm7iuurnDDn7oo6G73TpABFKDmABFKDmZPHKDmABFKDm6jOAWn_0_r1909380871_ATLAS_result
14.01.2018 14:15:08 |  | Internet access OK - project servers may be temporarily down.
ID: 33847 · Report as offensive     Reply Quote
Profile Michael H.W. Weber

Send message
Joined: 18 Sep 04
Posts: 30
Credit: 5,100,929
RAC: 0
Message 33877 - Posted: 16 Jan 2018, 9:29:49 UTC
Last modified: 16 Jan 2018, 9:46:40 UTC

I have an ATLAS task not uploading since many, many days:

16.01.2018 08:55:48 | LHC@home | Started upload of WqtNDme5DvrnDDn7oo6G73TpABFKDmABFKDm8sHKDmABFKDmT56S3n_0_r205511374_ATLAS_result
16.01.2018 09:00:55 | LHC@home | Temporarily failed upload of WqtNDme5DvrnDDn7oo6G73TpABFKDmABFKDm8sHKDmABFKDmT56S3n_0_r205511374_ATLAS_result: transient HTTP error
16.01.2018 09:00:55 | LHC@home | Backing off 04:15:14 on upload of WqtNDme5DvrnDDn7oo6G73TpABFKDmABFKDm8sHKDmABFKDmT56S3n_0_r205511374_ATLAS_result

16.01.2018 10:12:25 | LHC@home | Started upload of WqtNDme5DvrnDDn7oo6G73TpABFKDmABFKDm8sHKDmABFKDmT56S3n_0_r205511374_ATLAS_result
16.01.2018 10:12:47 | LHC@home | Temporarily failed upload of WqtNDme5DvrnDDn7oo6G73TpABFKDmABFKDm8sHKDmABFKDmT56S3n_0_r205511374_ATLAS_result: connect() failed
16.01.2018 10:12:47 | LHC@home | Backing off 03:54:55 on upload of WqtNDme5DvrnDDn7oo6G73TpABFKDmABFKDm8sHKDmABFKDmT56S3n_0_r205511374_ATLAS_result

Strangely, the ATLAS task data listed in my account is not consistent with the data displayed in my client: While tha task date is identical, download and due dates differ. The task was neither delivered on 15th of January by the server (instead many days earlier) nor has it to be complete on 23rd of January (but on 22nd).
Do you have a database problem?

Michael.
ID: 33877 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 728
Credit: 49,050,975
RAC: 27,146
Message 33879 - Posted: 16 Jan 2018, 10:29:43 UTC - in response to Message 33877.  

The due dates (deadline) differ by one day for LHC tasks. Boinc manager says that due date is one day earlier than server. I have never seen an explanation why, but it has been like this for years.
ID: 33879 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1814
Credit: 118,498,107
RAC: 30,817
Message 34015 - Posted: 21 Jan 2018, 15:01:35 UTC

For quite a while now, ATLAS uploads fail with "server out of disk space".
ID: 34015 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1814
Credit: 118,498,107
RAC: 30,817
Message 34020 - Posted: 22 Jan 2018, 6:27:26 UTC - in response to Message 34015.  
Last modified: 22 Jan 2018, 6:30:20 UTC

the error notices seem to change from time to time: since last night, it always says "locked by upload handler" and "transient upload error" - the same what we had from Mid-December on most of the time.

Meanwhile, the number of "unsent" ATLAS tasks on the Project Status Page is "0" - which is best they can do, anyway.
I think it does not make any sense to send out ATLAS tasks for crunching as long as all these severe file transfer (and other) problems persist.
ID: 34020 · Report as offensive     Reply Quote
Previous · 1 . . . 3 · 4 · 5 · 6

Message boards : ATLAS application : Uploads of finished tasks not possible since last night


©2024 CERN