Message boards : ATLAS application : Uploads of finished tasks not possible since last night
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 . . . 6 · Next

AuthorMessage
csbyseti

Send message
Joined: 6 Jul 17
Posts: 22
Credit: 29,083,578
RAC: 22
Message 33320 - Posted: 13 Dec 2017, 22:06:31 UTC - in response to Message 33319.  
Last modified: 13 Dec 2017, 22:07:17 UTC

upload don't work correct, failed finish.
It looks like the Server don't accept the old WU's which are downloaded before the Server Crash.
In some hours all 24 WU's on both Ryzen machines are ready for upload, won't be nice if i have to delete them to get new WU's.
And most WU's are long running WU's, 4,2 GB of upload size on every machine.
ID: 33320 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 1312
Credit: 39,807,265
RAC: 18,371
Message 33326 - Posted: 14 Dec 2017, 5:41:14 UTC - in response to Message 33320.  

In some hours all 24 WU's on both Ryzen machines are ready for upload, won't be nice if i have to delete them to get new WU's.

You can set in LHC-preferences the number of tasks to unlimited or 8 as max.
It will help to download some more. The good is, that there are 200 events and the tasks are running a long time to finish.
Cern-IT do find a solution to upload, we all hope so.
ID: 33326 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1460
Credit: 35,636,146
RAC: 43,726
Message 33327 - Posted: 14 Dec 2017, 6:10:28 UTC - in response to Message 33326.  
Last modified: 14 Dec 2017, 6:31:39 UTC

It looks like the Server don't accept the old WU's which are downloaded before the Server Crash
yes, seems so. The problem now exists for 30 hours, and is still unsolved.
I have some 10 tasks waiting for upload, it would be a shame if they are lost - dozens of hours of computation time for nothing :-(

Cern-IT do find a solution to upload, we all hope so.
It's too bad that their information policy vis-a-vis us crunchers is rather restrictive. The only information we got so far was the posting from Nils yesterday morning. No more since :-(
ID: 33327 · Report as offensive     Reply Quote
csbyseti

Send message
Joined: 6 Jul 17
Posts: 22
Credit: 29,083,578
RAC: 22
Message 33329 - Posted: 14 Dec 2017, 7:21:18 UTC - in response to Message 33326.  

i think 24 task per Boinc Instance is the maximum.
Unlimited will result in a low number of Task (value forgotten)
The actual task need about 7 hours to finish so throughput will be ~ 17 Task per 24 hour (5 active instances with 3 Cores)

Started a new Boinc instance this morning, got new Task ( 1 Download Error ).
Let's see if upload of the new Task will finish ( first results in about 7 hours)
ID: 33329 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1460
Credit: 35,636,146
RAC: 43,726
Message 33332 - Posted: 14 Dec 2017, 8:39:11 UTC - in response to Message 33329.  

if one takes a look at this diagram
https://lhcathome.cern.ch/ATLAS/atlas_job.php on the number of ATLAS jobs running, it's no surprise that the servers broke down.

The number of ATLAS jobs jumped up from about 8.000 on Dec. 12 to about 17.000 on Dec. 13.
ID: 33332 · Report as offensive     Reply Quote
AuxRx

Send message
Joined: 16 Sep 17
Posts: 100
Credit: 1,618,469
RAC: 0
Message 33334 - Posted: 14 Dec 2017, 9:12:02 UTC - in response to Message 33332.  

Those will mostly be old WU waiting to be uploaded. Not active compute jobs. As I mentioned earlier, once this number falls to around 10,000 you can assume the issue has been fixed.
ID: 33334 · Report as offensive     Reply Quote
AuxRx

Send message
Joined: 16 Sep 17
Posts: 100
Credit: 1,618,469
RAC: 0
Message 33336 - Posted: 14 Dec 2017, 9:23:56 UTC

I was able to successfully upload 1 task now. 4 more are still stuck.
ID: 33336 · Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project scientist

Send message
Joined: 13 May 14
Posts: 355
Credit: 12,056,116
RAC: 2,131
Message 33340 - Posted: 14 Dec 2017, 10:07:07 UTC

It seems like new tasks are able to upload but those that finished on Tuesday night when we had the broken server are still stuck. The admins are looking into it.
ID: 33340 · Report as offensive     Reply Quote
csbyseti

Send message
Joined: 6 Jul 17
Posts: 22
Credit: 29,083,578
RAC: 22
Message 33341 - Posted: 14 Dec 2017, 10:16:50 UTC - in response to Message 33336.  

the new Boinc instance with new Task can upload (9 Task uploaded yet) but the older once have still the problem.
So it seem to be a Database problem and not a performane problem of the Server.
Instead of putting result in database the uploaded file goes to dev nul.
ID: 33341 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1046
Credit: 6,603,873
RAC: 275
Message 33344 - Posted: 14 Dec 2017, 11:37:41 UTC - in response to Message 33340.  
Last modified: 14 Dec 2017, 11:38:30 UTC

It seems like new tasks are able to upload but those that finished on Tuesday night when we had the broken server are still stuck. The admins are looking into it.

It looks like the try of a result upload occupies a slot on the server, that's not freed when an upload fails.
Maybe a retry therefore fails over and over until that slot is freed manually.
ID: 33344 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1460
Credit: 35,636,146
RAC: 43,726
Message 33346 - Posted: 14 Dec 2017, 12:06:21 UTC - in response to Message 33344.  
Last modified: 14 Dec 2017, 12:07:09 UTC

... Maybe a retry therefore fails over and over until that slot is freed manually.
OMG, how can this be accomplished (if at all) ?
ID: 33346 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 1312
Credit: 39,807,265
RAC: 18,371
Message 33347 - Posted: 14 Dec 2017, 12:16:41 UTC

This is a message from upload-Server:

14.12.2017 13:07:41 | LHC@home | [error] Error reported by file upload server: [qeyLDmVScirnDDn7oo6G73TpABFKDmABFKDmGPHKDmABFKDmmplz6m_0_r1896860968_ATLAS_result] locked by file_upload_handler PID=-1
ID: 33347 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 1312
Credit: 39,807,265
RAC: 18,371
Message 33348 - Posted: 14 Dec 2017, 12:22:55 UTC - in response to Message 33346.  

OMG, how can this be accomplished (if at all) ?


The first rule when you work in IT --- CALM and than do the right.
Cern-IT do the best to help us for uploading.
ID: 33348 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1460
Credit: 35,636,146
RAC: 43,726
Message 33349 - Posted: 14 Dec 2017, 13:19:09 UTC - in response to Message 33348.  

The first rule when you work in IT --- CALM and than do the right.
Cern-IT do the best to help us for uploading.
the main problem though is that many of these "older" tasks are approaching their deadline.
So, if they cannot be uploaded soon, they're gone :-(
ID: 33349 · Report as offensive     Reply Quote
San-Fernando-Valley

Send message
Joined: 26 Mar 16
Posts: 30
Credit: 1,245,747
RAC: 0
Message 33352 - Posted: 14 Dec 2017, 17:04:07 UTC

I'm sure you are working on it ... or maybe not ???

Just to add my ten pennies:
all WU that I started a couple of hours ago on all my rigs and have finished successfully and want to upload
are now either in "project backoff" status or waiting for "retry".

Hope someone is STILL bravely working on the problem ...
ID: 33352 · Report as offensive     Reply Quote
Brummig
Avatar

Send message
Joined: 9 Feb 16
Posts: 40
Credit: 533,053
RAC: 3
Message 33354 - Posted: 14 Dec 2017, 17:41:07 UTC - in response to Message 33352.  
Last modified: 14 Dec 2017, 17:46:07 UTC

I have the same problem. However, what alerted me to there being a problem is that on BoincStats my LHC rank has dropped by a whopping 330 in one day. Sometimes I drop back a few places in any one day, but on average I creep forward each day. Also, I notice everyone around me in the ranking tables has dropped back about 300 places, as have large numbers of users well above my position in the table. I notice too that some users near the top of the table are listed as "new" with a very large number of points, but a tiny RAC and no activity over the past month.
ID: 33354 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1460
Credit: 35,636,146
RAC: 43,726
Message 33356 - Posted: 14 Dec 2017, 17:58:12 UTC - in response to Message 33352.  

Hope someone is STILL bravely working on the problem ...
what the people at CERN should be aware of is that the longer it takes them to get the problem solved, the more finished tasks will reach their deadline, making them invalid.
So let's hope that, with this in mind, the IT specialists will do their best to get the system work as soon as possible.
ID: 33356 · Report as offensive     Reply Quote
AuxRx

Send message
Joined: 16 Sep 17
Posts: 100
Credit: 1,618,469
RAC: 0
Message 33358 - Posted: 14 Dec 2017, 18:17:10 UTC - in response to Message 33354.  
Last modified: 14 Dec 2017, 18:18:06 UTC

I have the same problem. However, what alerted me to there being a problem is that on BoincStats my LHC rank has dropped by a whopping 330 in one day. Sometimes I drop back a few places in any one day, but on average I creep forward each day. Also, I notice everyone around me in the ranking tables has dropped back about 300 places, as have large numbers of users well above my position in the table. I notice too that some users near the top of the table are listed as "new" with a very large number of points, but a tiny RAC and no activity over the past month.


There was an issue with overzealus spam removal. Falsely deleted accounts had to be restored. That might have pushed you back.
ID: 33358 · Report as offensive     Reply Quote
San-Fernando-Valley

Send message
Joined: 26 Mar 16
Posts: 30
Credit: 1,245,747
RAC: 0
Message 33359 - Posted: 14 Dec 2017, 18:29:58 UTC - in response to Message 33358.  


There was an issue with overzealus spam removal. Falsely deleted accounts had to be restored. That might have pushed you back.


... nothing is pushing ME back - the finished WUs are just NOT BEING UPLOADED ...
So the retry time is climbing into many hours ...

I guess that has nothing to do with falsly deleted accounts ...

While I am griping around here, I would like add/point out, that I have plenty of own work to accomplish without having to check
if LHC is having troubles or not and to read , read, read, try, try, tray and what not to get WUs uploaded ...

Here comes the nice part:
Have a nice day !
ID: 33359 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Nov 14
Posts: 576
Credit: 18,029,565
RAC: 23,105
Message 33360 - Posted: 14 Dec 2017, 18:40:29 UTC - in response to Message 33359.  

While I am griping around here, I would like add/point out, that I have plenty of own work to accomplish without having to check
if LHC is having troubles or not and to read , read, read, try, try, tray and what not to get WUs uploaded ...

I concluded a long time ago that LHC is not a "set and forget" project. Sometimes they can go a long time without problems, and then the roof falls in.
That is what happens when you are dealing with the most advanced physics experiment in the world. It is not a cookie-cutter operation.
ID: 33360 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 . . . 6 · Next

Message boards : ATLAS application : Uploads of finished tasks not possible since last night


©2021 CERN