New specific file upload error

Author	Message
greg_be Send message Joined: 28 Dec 08 Posts: 346 Credit: 5,377,297 RAC: 6,435	Message 33618 - Posted: 2 Jan 2018, 0:53:17 UTC 1/2/2018 1:23:14 AM \| LHC@home \| [error] Error reported by file upload server: [rf0MDm29VprnSu7Ccp2YYBZmABFKDmABFKDm6aFKDmABFKDm1AEbGn_0_r1764236154_ATLAS_result] locked by file_upload_handler PID=-1 What's this all about? ID: 33618 · Reply Quote

ritterm Send message Joined: 30 May 08 Posts: 93 Credit: 5,160,246 RAC: 0	Message 33619 - Posted: 2 Jan 2018, 1:00:43 UTC - in response to Message 33618. Last modified: 2 Jan 2018, 1:00:58 UTC What's this all about? There are several mentions of this kind of error in the Uploads of finished tasks not possible since last night thread. Of particular interest might be Message 33420. ID: 33619 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1907 Credit: 144,328,434 RAC: 73,938	Message 33620 - Posted: 2 Jan 2018, 5:56:57 UTC - in response to Message 33619. Of particular interest might be Message 33420. well, repeating part of the above mentioned message 33420 from David Cameron: - the upload server is overloaded, so many uploads fail leaving a half-complete file - the retries fail because the half-complete file is still there (the "locked by file_upload_handler PID=-1" error) - our cleaning of incomplete files runs only once per day so there is no possibliity of retries succeeding until one day has passed - the server getting full last night was yet another problem but this is now fixed I have changed the cleaning to run once every 6 hours and delete files older than 6 hours to make it more aggressive. But if you have a failed upload you'll still have to wait some time before it will work, so clicking retry every few minutes won't help. As much as I understood all these problems when there were almost 25.000 tasks in the mills at that time, I am wondering why these upload problems still exist (I also have got several new ones last night) now, when there are not more than roughly 10.000 tasks being crunched. So I guess that some other problem must be involved. ID: 33620 · Reply Quote

AuxRx Send message Joined: 16 Sep 17 Posts: 100 Credit: 1,618,469 RAC: 0	Message 33622 - Posted: 2 Jan 2018, 11:13:21 UTC - in response to Message 33620. As much as I understood all these problems when there were almost 25.000 tasks in the mills at that time, I am wondering why these upload problems still exist (I also have got several new ones last night) now, when there are not more than roughly 10.000 tasks being crunched. Because the bottleneck has not been removed. Sixtrack has lots of work queued atm and the projects share the same file server afaik. The official statement was that the issue will be fixed in mid January. ID: 33622 · Reply Quote

Harri Liljeroos Send message Joined: 28 Sep 04 Posts: 780 Credit: 59,659,148 RAC: 44,940	Message 33623 - Posted: 2 Jan 2018, 11:20:37 UTC - in response to Message 33622. As much as I understood all these problems when there were almost 25.000 tasks in the mills at that time, I am wondering why these upload problems still exist (I also have got several new ones last night) now, when there are not more than roughly 10.000 tasks being crunched. Because the bottleneck has not been removed. Sixtrack has lots of work queued atm and the projects share the same file server afaik. The official statement was that the issue will be fixed in mid January. And nothing has been done to ease up the fileserver load from sixtrack tasks. The Ready To Send queue is once again increasing reaching now about 1.3 million tasks. When a newly created task is added to the queue (like a resend because a task was not returned by deadline) takes about 11 days to crawl thru the queue before reaching a new host. See one here: https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=82154077 ID: 33623 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1907 Credit: 144,328,434 RAC: 73,938	Message 33625 - Posted: 2 Jan 2018, 11:39:16 UTC - in response to Message 33623. Last modified: 2 Jan 2018, 11:39:27 UTC And nothing has been done to ease up the fileserver load from sixtrack tasks. The Ready To Send queue is once again increasing reaching now about 1.3 million tasks. When a newly created task is added to the queue (like a resend because a task was not returned by deadline) takes about 11 days to crawl thru the queue before reaching a new host. See one here: https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=82154077 this really doesn't seem to make a whole lot of sense :-( ID: 33625 · Reply Quote

greg_be Send message Joined: 28 Dec 08 Posts: 346 Credit: 5,377,297 RAC: 6,435	Message 33637 - Posted: 2 Jan 2018, 23:11:18 UTC - in response to Message 33623. New work generation has nothing to do with file upload. Though that is interesting. But also leads to the question why six track was unthrottled and allowed to mess with uploads of other tasks. That I am getting PID errors well after the problem was discovered says to me the strategy employed is getting overpowered yet again. Time to get some serious hardware or larger drive or whatever to handle the increased demand. I guess the error does not matter as far as results go, since it appears the task is not lost when i disconnect the local client for the night and shut down my system. ID: 33637 · Reply Quote

[AF>Amis des Lapins] Phil1966 Send message Joined: 23 Apr 10 Posts: 5 Credit: 1,425,831 RAC: 6,586	Message 33665 - Posted: 5 Jan 2018, 4:56:11 UTC Same problem here. Will stop ATLAS again. Wanted to run it as main 2018 project, but will wait until issues are "really" fixed. ID: 33665 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1907 Credit: 144,328,434 RAC: 73,938	Message 35188 - Posted: 8 May 2018, 6:07:56 UTC the "locked by file_upload_handler PID=55833" problem seems to be back :-( I've had it several times during the past days. ID: 35188 · Reply Quote

LHC@home