log in

Atlas task running over 45 hours, 100% complete


Advanced search

Message boards : ATLAS application : Atlas task running over 45 hours, 100% complete

Author Message
Profile thomasroderick
Send message
Joined: 22 May 17
Posts: 10
Credit: 230,327
RAC: 2,637
Message 32254 - Posted: 5 Sep 2017, 13:58:28 UTC

My computer has been humming away for a couple weeks, loading 8 tasks at a time, and running through them one by one, at about 2 hours per task. A couple days ago, 1 task started, and currently sits at 100.000% complete after 1d 21:46:25 elapsed time. It was going at normal rate until it hit 97% (after about 2 hours), and then has crawled to 100% over the next 43 hours. No other tasks started or ran, 4 CPUs devoted to this one task. I tried suspending, resuming, updating the project, restarting BOINC, rebooted the computer... nothing has kicked it over. I have suspended and resumed other tasks, and they are all running and completing appropriately.

This is task: 154132448, Work Unit: 73907132. It has a deadline about 13 hours from right now. I do not really care about the credit. I simply hate to see a completed research effort get destroyed.

Any thoughts on how to get this over the line? Or is this a case of aborting the task and moving on? Have not seen anything in the logs to indicate there was an issue, and other tasks around it did not have problems. Thoughts greatly appreciated, thank you in advance.

- Tom.

Profile Yeti
Volunteer moderator
Avatar
Send message
Joined: 2 Sep 04
Posts: 281
Credit: 41,057,569
RAC: 50,496
Message 32262 - Posted: 5 Sep 2017, 14:44:36 UTC

Take a short journey through my checklist Point 16e and following.
____________


Supporting BOINC, a great concept !

Profile Yeti
Volunteer moderator
Avatar
Send message
Joined: 2 Sep 04
Posts: 281
Credit: 41,057,569
RAC: 50,496
Message 32264 - Posted: 5 Sep 2017, 14:46:07 UTC

And look here: https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4422
____________


Supporting BOINC, a great concept !

Profile thomasroderick
Send message
Joined: 22 May 17
Posts: 10
Credit: 230,327
RAC: 2,637
Message 32286 - Posted: 5 Sep 2017, 18:38:27 UTC - in response to Message 32264.

Thank you, Yeti, for the assistance. Greatly appreciated. I have run through the checklist previously, looked at 16e specifically today. In the VM, I can get to the login and password screen, that loads quickly. I tried the Alt/F2 to see what was processing. The screen reads, "Event Processing information will appear here" and the screen is black. Of course, the task says it is 100% progress, but the elapsed time is continuing to run. I have had two other Atlas tasks run and complete this morning, while this one was suspended.

Any other suggestions? Or is this one simply a lost cause...

Profile Yeti
Volunteer moderator
Avatar
Send message
Joined: 2 Sep 04
Posts: 281
Credit: 41,057,569
RAC: 50,496
Message 32293 - Posted: 5 Sep 2017, 20:32:58 UTC - in response to Message 32286.

Today in the evening I added some more Details to 16e and 17. Please check again and let me know if this has helped you to make a decision
____________


Supporting BOINC, a great concept !

Profile thomasroderick
Send message
Joined: 22 May 17
Posts: 10
Credit: 230,327
RAC: 2,637
Message 32298 - Posted: 6 Sep 2017, 2:26:36 UTC - in response to Message 32293.

Here's what I got on the Properties:
CPU Last Checkpoint: 47:20
CPU - Time: 47:20
Elapsed Time: 1d 21:56:42

Every subsequent check was similar, the CPU Last Check and Time increased and were the same, elapsed goes up.

Other Properties:
Received 9/2/2017 10:02:24am
Report Deadline: 9/5/2017 10:02:23pm
Est. Computation size: 16,020 GFLOPS
Est. Time Remaining -----
Fraction Done: 100.000%
Virtual mem size: 112.37MB
Working set size: 5.66 GB
Progress Rate: 2.160% per hour

Alas... it appears I may have run out of time. The deadline is only 45 minutes from now. I will let it run until then and see what happens. The other Atlas tasks will kick in after it clears.

Thank you for your your comments and assistance. The checklist has been beneficial as well.

Profile Yeti
Volunteer moderator
Avatar
Send message
Joined: 2 Sep 04
Posts: 281
Credit: 41,057,569
RAC: 50,496
Message 32313 - Posted: 6 Sep 2017, 17:57:53 UTC - in response to Message 32298.

From the properties it looks fine for a 1-Core-WU

Could you check with ALT/F1 - ALT/F3 ?
____________


Supporting BOINC, a great concept !

Profile thomasroderick
Send message
Joined: 22 May 17
Posts: 10
Credit: 230,327
RAC: 2,637
Message 32331 - Posted: 7 Sep 2017, 14:10:03 UTC - in response to Message 32313.

F1: Immediately takes me to the login.
F2: Empty black screen, save for the single line at the top, "Event Processing information will appear here." But no additional lines of information.
F3: Image below.

Profile Yeti
Volunteer moderator
Avatar
Send message
Joined: 2 Sep 04
Posts: 281
Credit: 41,057,569
RAC: 50,496
Message 32333 - Posted: 7 Sep 2017, 14:50:52 UTC - in response to Message 32331.

Looks good for a 4-Core-WU
____________


Supporting BOINC, a great concept !

Profile thomasroderick
Send message
Joined: 22 May 17
Posts: 10
Credit: 230,327
RAC: 2,637
Message 32336 - Posted: 7 Sep 2017, 17:56:03 UTC - in response to Message 32333.

I let it run for a little while longer... elapsed time of 2d 1:33:33. Still sitting at 100% with ----- remaining. I checked my tasks online and that specific one is now saying, "Timed out - no response." So it appears this one will be lost, and I will abort from my system. Thanks for looking into the situation. It has been a valuable learning experience for me, with you guiding me through.

Profile thomasroderick
Send message
Joined: 22 May 17
Posts: 10
Credit: 230,327
RAC: 2,637
Message 32338 - Posted: 7 Sep 2017, 20:45:08 UTC - in response to Message 32336.

It finally gave up the ghost a few minutes ago. On the BOINC manager, came up with a status: Aborted, File disk full. The task output can be found at:

https://lhcathome.cern.ch/lhcathome/result.php?resultid=154132448

Run time and CPU time were drastically different, so there was something corrupt with my working on this task. Maybe a power or network glitch or something. Chalk it up to the gremlins.

Message boards : ATLAS application : Atlas task running over 45 hours, 100% complete