Message boards : ATLAS application : New specific file upload error
Message board moderation

To post messages, you must log in.

AuthorMessage
greg_be

Send message
Joined: 28 Dec 08
Posts: 76
Credit: 701,116
RAC: 1,177
Message 33618 - Posted: 2 Jan 2018, 0:53:17 UTC

1/2/2018 1:23:14 AM | LHC@home | [error] Error reported by file upload server: [rf0MDm29VprnSu7Ccp2YYBZmABFKDmABFKDm6aFKDmABFKDm1AEbGn_0_r1764236154_ATLAS_result] locked by file_upload_handler PID=-1


What's this all about?
ID: 33618 · Report as offensive     Reply Quote
Profile ritterm
Avatar

Send message
Joined: 30 May 08
Posts: 88
Credit: 3,840,700
RAC: 6
Message 33619 - Posted: 2 Jan 2018, 1:00:43 UTC - in response to Message 33618.  
Last modified: 2 Jan 2018, 1:00:58 UTC

What's this all about?

There are several mentions of this kind of error in the Uploads of finished tasks not possible since last night thread. Of particular interest might be Message 33420.
ID: 33619 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 870
Credit: 6,479,875
RAC: 10,574
Message 33620 - Posted: 2 Jan 2018, 5:56:57 UTC - in response to Message 33619.  

Of particular interest might be Message 33420.
well, repeating part of the above mentioned message 33420 from David Cameron:

- the upload server is overloaded, so many uploads fail leaving a half-complete file
- the retries fail because the half-complete file is still there (the "locked by file_upload_handler PID=-1" error)
- our cleaning of incomplete files runs only once per day so there is no possibliity of retries succeeding until one day has passed
- the server getting full last night was yet another problem but this is now fixed

I have changed the cleaning to run once every 6 hours and delete files older than 6 hours to make it more aggressive. But if you have a failed upload you'll still have to wait some time before it will work, so clicking retry every few minutes won't help.

As much as I understood all these problems when there were almost 25.000 tasks in the mills at that time, I am wondering why these upload problems still exist (I also have got several new ones last night) now, when there are not more than roughly 10.000 tasks being crunched.

So I guess that some other problem must be involved.
ID: 33620 · Report as offensive     Reply Quote
AuxRx

Send message
Joined: 16 Sep 17
Posts: 86
Credit: 936,144
RAC: 7,297
Message 33622 - Posted: 2 Jan 2018, 11:13:21 UTC - in response to Message 33620.  

As much as I understood all these problems when there were almost 25.000 tasks in the mills at that time, I am wondering why these upload problems still exist (I also have got several new ones last night) now, when there are not more than roughly 10.000 tasks being crunched.


Because the bottleneck has not been removed. Sixtrack has lots of work queued atm and the projects share the same file server afaik.

The official statement was that the issue will be fixed in mid January.
ID: 33622 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 300
Credit: 10,740,955
RAC: 15,315
Message 33623 - Posted: 2 Jan 2018, 11:20:37 UTC - in response to Message 33622.  

As much as I understood all these problems when there were almost 25.000 tasks in the mills at that time, I am wondering why these upload problems still exist (I also have got several new ones last night) now, when there are not more than roughly 10.000 tasks being crunched.


Because the bottleneck has not been removed. Sixtrack has lots of work queued atm and the projects share the same file server afaik.

The official statement was that the issue will be fixed in mid January.

And nothing has been done to ease up the fileserver load from sixtrack tasks. The Ready To Send queue is once again increasing reaching now about 1.3 million tasks. When a newly created task is added to the queue (like a resend because a task was not returned by deadline) takes about 11 days to crawl thru the queue before reaching a new host. See one here: https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=82154077
ID: 33623 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 870
Credit: 6,479,875
RAC: 10,574
Message 33625 - Posted: 2 Jan 2018, 11:39:16 UTC - in response to Message 33623.  
Last modified: 2 Jan 2018, 11:39:27 UTC

And nothing has been done to ease up the fileserver load from sixtrack tasks. The Ready To Send queue is once again increasing reaching now about 1.3 million tasks. When a newly created task is added to the queue (like a resend because a task was not returned by deadline) takes about 11 days to crawl thru the queue before reaching a new host. See one here: https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=82154077
this really doesn't seem to make a whole lot of sense :-(
ID: 33625 · Report as offensive     Reply Quote
greg_be

Send message
Joined: 28 Dec 08
Posts: 76
Credit: 701,116
RAC: 1,177
Message 33637 - Posted: 2 Jan 2018, 23:11:18 UTC - in response to Message 33623.  

New work generation has nothing to do with file upload. Though that is interesting. But also leads to the question why six track was unthrottled and allowed to mess with uploads of other tasks.

That I am getting PID errors well after the problem was discovered says to me the strategy employed is getting overpowered yet again.

Time to get some serious hardware or larger drive or whatever to handle the increased demand.

I guess the error does not matter as far as results go, since it appears the task is not lost when i disconnect the local client for the night and shut down my system.
ID: 33637 · Report as offensive     Reply Quote
Profile [AF>Amis des Lapins] Phil1966

Send message
Joined: 23 Apr 10
Posts: 5
Credit: 1,319,284
RAC: 0
Message 33665 - Posted: 5 Jan 2018, 4:56:11 UTC

Same problem here. Will stop ATLAS again. Wanted to run it as main 2018 project, but will wait until issues are "really" fixed.
ID: 33665 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 870
Credit: 6,479,875
RAC: 10,574
Message 35188 - Posted: 8 May 2018, 6:07:56 UTC

the "locked by file_upload_handler PID=55833" problem seems to be back :-(
I've had it several times during the past days.
ID: 35188 · Report as offensive     Reply Quote

Message boards : ATLAS application : New specific file upload error


©2018 CERN