Message boards :
ATLAS application :
Uploads of finished tasks not possible since last night
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 . . . 6 · Next
Author | Message |
---|---|
Send message Joined: 6 Jul 17 Posts: 22 Credit: 29,430,354 RAC: 0 |
upload don't work correct, failed finish. It looks like the Server don't accept the old WU's which are downloaded before the Server Crash. In some hours all 24 WU's on both Ryzen machines are ready for upload, won't be nice if i have to delete them to get new WU's. And most WU's are long running WU's, 4,2 GB of upload size on every machine. |
Send message Joined: 2 May 07 Posts: 2193 Credit: 173,360,031 RAC: 50,408 |
In some hours all 24 WU's on both Ryzen machines are ready for upload, won't be nice if i have to delete them to get new WU's. You can set in LHC-preferences the number of tasks to unlimited or 8 as max. It will help to download some more. The good is, that there are 200 events and the tasks are running a long time to finish. Cern-IT do find a solution to upload, we all hope so. |
Send message Joined: 18 Dec 15 Posts: 1752 Credit: 115,690,480 RAC: 86,737 |
It looks like the Server don't accept the old WU's which are downloaded before the Server Crashyes, seems so. The problem now exists for 30 hours, and is still unsolved. I have some 10 tasks waiting for upload, it would be a shame if they are lost - dozens of hours of computation time for nothing :-( Cern-IT do find a solution to upload, we all hope so.It's too bad that their information policy vis-a-vis us crunchers is rather restrictive. The only information we got so far was the posting from Nils yesterday morning. No more since :-( |
Send message Joined: 6 Jul 17 Posts: 22 Credit: 29,430,354 RAC: 0 |
i think 24 task per Boinc Instance is the maximum. Unlimited will result in a low number of Task (value forgotten) The actual task need about 7 hours to finish so throughput will be ~ 17 Task per 24 hour (5 active instances with 3 Cores) Started a new Boinc instance this morning, got new Task ( 1 Download Error ). Let's see if upload of the new Task will finish ( first results in about 7 hours) |
Send message Joined: 18 Dec 15 Posts: 1752 Credit: 115,690,480 RAC: 86,737 |
if one takes a look at this diagram https://lhcathome.cern.ch/ATLAS/atlas_job.php on the number of ATLAS jobs running, it's no surprise that the servers broke down. The number of ATLAS jobs jumped up from about 8.000 on Dec. 12 to about 17.000 on Dec. 13. |
Send message Joined: 16 Sep 17 Posts: 100 Credit: 1,618,469 RAC: 0 |
Those will mostly be old WU waiting to be uploaded. Not active compute jobs. As I mentioned earlier, once this number falls to around 10,000 you can assume the issue has been fixed. |
Send message Joined: 16 Sep 17 Posts: 100 Credit: 1,618,469 RAC: 0 |
I was able to successfully upload 1 task now. 4 more are still stuck. |
Send message Joined: 13 May 14 Posts: 387 Credit: 15,314,184 RAC: 0 |
It seems like new tasks are able to upload but those that finished on Tuesday night when we had the broken server are still stuck. The admins are looking into it. |
Send message Joined: 6 Jul 17 Posts: 22 Credit: 29,430,354 RAC: 0 |
the new Boinc instance with new Task can upload (9 Task uploaded yet) but the older once have still the problem. So it seem to be a Database problem and not a performane problem of the Server. Instead of putting result in database the uploaded file goes to dev nul. |
Send message Joined: 14 Jan 10 Posts: 1378 Credit: 9,162,540 RAC: 5,071 |
It seems like new tasks are able to upload but those that finished on Tuesday night when we had the broken server are still stuck. The admins are looking into it. It looks like the try of a result upload occupies a slot on the server, that's not freed when an upload fails. Maybe a retry therefore fails over and over until that slot is freed manually. |
Send message Joined: 18 Dec 15 Posts: 1752 Credit: 115,690,480 RAC: 86,737 |
... Maybe a retry therefore fails over and over until that slot is freed manually.OMG, how can this be accomplished (if at all) ? |
Send message Joined: 2 May 07 Posts: 2193 Credit: 173,360,031 RAC: 50,408 |
This is a message from upload-Server: 14.12.2017 13:07:41 | LHC@home | [error] Error reported by file upload server: [qeyLDmVScirnDDn7oo6G73TpABFKDmABFKDmGPHKDmABFKDmmplz6m_0_r1896860968_ATLAS_result] locked by file_upload_handler PID=-1 |
Send message Joined: 2 May 07 Posts: 2193 Credit: 173,360,031 RAC: 50,408 |
OMG, how can this be accomplished (if at all) ? The first rule when you work in IT --- CALM and than do the right. Cern-IT do the best to help us for uploading. |
Send message Joined: 18 Dec 15 Posts: 1752 Credit: 115,690,480 RAC: 86,737 |
The first rule when you work in IT --- CALM and than do the right.the main problem though is that many of these "older" tasks are approaching their deadline. So, if they cannot be uploaded soon, they're gone :-( |
Send message Joined: 26 Mar 16 Posts: 30 Credit: 1,258,609 RAC: 0 |
I'm sure you are working on it ... or maybe not ??? Just to add my ten pennies: all WU that I started a couple of hours ago on all my rigs and have finished successfully and want to upload are now either in "project backoff" status or waiting for "retry". Hope someone is STILL bravely working on the problem ... |
Send message Joined: 9 Feb 16 Posts: 48 Credit: 537,111 RAC: 0 |
I have the same problem. However, what alerted me to there being a problem is that on BoincStats my LHC rank has dropped by a whopping 330 in one day. Sometimes I drop back a few places in any one day, but on average I creep forward each day. Also, I notice everyone around me in the ranking tables has dropped back about 300 places, as have large numbers of users well above my position in the table. I notice too that some users near the top of the table are listed as "new" with a very large number of points, but a tiny RAC and no activity over the past month. |
Send message Joined: 18 Dec 15 Posts: 1752 Credit: 115,690,480 RAC: 86,737 |
Hope someone is STILL bravely working on the problem ...what the people at CERN should be aware of is that the longer it takes them to get the problem solved, the more finished tasks will reach their deadline, making them invalid. So let's hope that, with this in mind, the IT specialists will do their best to get the system work as soon as possible. |
Send message Joined: 16 Sep 17 Posts: 100 Credit: 1,618,469 RAC: 0 |
I have the same problem. However, what alerted me to there being a problem is that on BoincStats my LHC rank has dropped by a whopping 330 in one day. Sometimes I drop back a few places in any one day, but on average I creep forward each day. Also, I notice everyone around me in the ranking tables has dropped back about 300 places, as have large numbers of users well above my position in the table. I notice too that some users near the top of the table are listed as "new" with a very large number of points, but a tiny RAC and no activity over the past month. There was an issue with overzealus spam removal. Falsely deleted accounts had to be restored. That might have pushed you back. |
Send message Joined: 26 Mar 16 Posts: 30 Credit: 1,258,609 RAC: 0 |
... nothing is pushing ME back - the finished WUs are just NOT BEING UPLOADED ... So the retry time is climbing into many hours ... I guess that has nothing to do with falsly deleted accounts ... While I am griping around here, I would like add/point out, that I have plenty of own work to accomplish without having to check if LHC is having troubles or not and to read , read, read, try, try, tray and what not to get WUs uploaded ... Here comes the nice part: Have a nice day ! |
Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0 |
While I am griping around here, I would like add/point out, that I have plenty of own work to accomplish without having to check I concluded a long time ago that LHC is not a "set and forget" project. Sometimes they can go a long time without problems, and then the roof falls in. That is what happens when you are dealing with the most advanced physics experiment in the world. It is not a cookie-cutter operation. |
©2024 CERN