Message boards : ATLAS application : Atlas task running over 45 hours, 100% complete
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Profile thomasroderick

Send message
Joined: 22 May 17
Posts: 15
Credit: 1,226,011
RAC: 569
Message 32254 - Posted: 5 Sep 2017, 13:58:28 UTC

My computer has been humming away for a couple weeks, loading 8 tasks at a time, and running through them one by one, at about 2 hours per task. A couple days ago, 1 task started, and currently sits at 100.000% complete after 1d 21:46:25 elapsed time. It was going at normal rate until it hit 97% (after about 2 hours), and then has crawled to 100% over the next 43 hours. No other tasks started or ran, 4 CPUs devoted to this one task. I tried suspending, resuming, updating the project, restarting BOINC, rebooted the computer... nothing has kicked it over. I have suspended and resumed other tasks, and they are all running and completing appropriately.

This is task: 154132448, Work Unit: 73907132. It has a deadline about 13 hours from right now. I do not really care about the credit. I simply hate to see a completed research effort get destroyed.

Any thoughts on how to get this over the line? Or is this a case of aborting the task and moving on? Have not seen anything in the logs to indicate there was an issue, and other tasks around it did not have problems. Thoughts greatly appreciated, thank you in advance.

- Tom.
ID: 32254 · Report as offensive     Reply Quote
Profile Yeti
Volunteer moderator
Avatar

Send message
Joined: 2 Sep 04
Posts: 455
Credit: 200,203,522
RAC: 46,621
Message 32262 - Posted: 5 Sep 2017, 14:44:36 UTC

Take a short journey through my checklist Point 16e and following.


Supporting BOINC, a great concept !
ID: 32262 · Report as offensive     Reply Quote
Profile Yeti
Volunteer moderator
Avatar

Send message
Joined: 2 Sep 04
Posts: 455
Credit: 200,203,522
RAC: 46,621
Message 32264 - Posted: 5 Sep 2017, 14:46:07 UTC

And look here: https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4422


Supporting BOINC, a great concept !
ID: 32264 · Report as offensive     Reply Quote
Profile thomasroderick

Send message
Joined: 22 May 17
Posts: 15
Credit: 1,226,011
RAC: 569
Message 32286 - Posted: 5 Sep 2017, 18:38:27 UTC - in response to Message 32264.  

Thank you, Yeti, for the assistance. Greatly appreciated. I have run through the checklist previously, looked at 16e specifically today. In the VM, I can get to the login and password screen, that loads quickly. I tried the Alt/F2 to see what was processing. The screen reads, "Event Processing information will appear here" and the screen is black. Of course, the task says it is 100% progress, but the elapsed time is continuing to run. I have had two other Atlas tasks run and complete this morning, while this one was suspended.

Any other suggestions? Or is this one simply a lost cause...
ID: 32286 · Report as offensive     Reply Quote
Profile Yeti
Volunteer moderator
Avatar

Send message
Joined: 2 Sep 04
Posts: 455
Credit: 200,203,522
RAC: 46,621
Message 32293 - Posted: 5 Sep 2017, 20:32:58 UTC - in response to Message 32286.  

Today in the evening I added some more Details to 16e and 17. Please check again and let me know if this has helped you to make a decision


Supporting BOINC, a great concept !
ID: 32293 · Report as offensive     Reply Quote
Profile thomasroderick

Send message
Joined: 22 May 17
Posts: 15
Credit: 1,226,011
RAC: 569
Message 32298 - Posted: 6 Sep 2017, 2:26:36 UTC - in response to Message 32293.  

Here's what I got on the Properties:
CPU Last Checkpoint: 47:20
CPU - Time: 47:20
Elapsed Time: 1d 21:56:42

Every subsequent check was similar, the CPU Last Check and Time increased and were the same, elapsed goes up.

Other Properties:
Received 9/2/2017 10:02:24am
Report Deadline: 9/5/2017 10:02:23pm
Est. Computation size: 16,020 GFLOPS
Est. Time Remaining -----
Fraction Done: 100.000%
Virtual mem size: 112.37MB
Working set size: 5.66 GB
Progress Rate: 2.160% per hour

Alas... it appears I may have run out of time. The deadline is only 45 minutes from now. I will let it run until then and see what happens. The other Atlas tasks will kick in after it clears.

Thank you for your your comments and assistance. The checklist has been beneficial as well.
ID: 32298 · Report as offensive     Reply Quote
Profile Yeti
Volunteer moderator
Avatar

Send message
Joined: 2 Sep 04
Posts: 455
Credit: 200,203,522
RAC: 46,621
Message 32313 - Posted: 6 Sep 2017, 17:57:53 UTC - in response to Message 32298.  

From the properties it looks fine for a 1-Core-WU

Could you check with ALT/F1 - ALT/F3 ?


Supporting BOINC, a great concept !
ID: 32313 · Report as offensive     Reply Quote
Profile thomasroderick

Send message
Joined: 22 May 17
Posts: 15
Credit: 1,226,011
RAC: 569
Message 32331 - Posted: 7 Sep 2017, 14:10:03 UTC - in response to Message 32313.  

F1: Immediately takes me to the login.
F2: Empty black screen, save for the single line at the top, "Event Processing information will appear here." But no additional lines of information.
F3: Image below.

ID: 32331 · Report as offensive     Reply Quote
Profile Yeti
Volunteer moderator
Avatar

Send message
Joined: 2 Sep 04
Posts: 455
Credit: 200,203,522
RAC: 46,621
Message 32333 - Posted: 7 Sep 2017, 14:50:52 UTC - in response to Message 32331.  

Looks good for a 4-Core-WU


Supporting BOINC, a great concept !
ID: 32333 · Report as offensive     Reply Quote
Profile thomasroderick

Send message
Joined: 22 May 17
Posts: 15
Credit: 1,226,011
RAC: 569
Message 32336 - Posted: 7 Sep 2017, 17:56:03 UTC - in response to Message 32333.  

I let it run for a little while longer... elapsed time of 2d 1:33:33. Still sitting at 100% with ----- remaining. I checked my tasks online and that specific one is now saying, "Timed out - no response." So it appears this one will be lost, and I will abort from my system. Thanks for looking into the situation. It has been a valuable learning experience for me, with you guiding me through.
ID: 32336 · Report as offensive     Reply Quote
Profile thomasroderick

Send message
Joined: 22 May 17
Posts: 15
Credit: 1,226,011
RAC: 569
Message 32338 - Posted: 7 Sep 2017, 20:45:08 UTC - in response to Message 32336.  

It finally gave up the ghost a few minutes ago. On the BOINC manager, came up with a status: Aborted, File disk full. The task output can be found at:

https://lhcathome.cern.ch/lhcathome/result.php?resultid=154132448

Run time and CPU time were drastically different, so there was something corrupt with my working on this task. Maybe a power or network glitch or something. Chalk it up to the gremlins.
ID: 32338 · Report as offensive     Reply Quote
Brummig
Avatar

Send message
Joined: 9 Feb 16
Posts: 48
Credit: 537,111
RAC: 0
Message 32643 - Posted: 6 Oct 2017, 7:56:13 UTC

I'm seeing this too. Most tasks complete normally, but a significant number go slower and slower and slower, and (usually) eventually fail. The information revealed by the Properties button indicates they are working, and the VM console confirms this (Alt-F3 shows two athena tasks working away like crazy as expected, and Alt-F2 shows events happening).

I've aborted most of these tasks, but I have let two run to the bitter conclusion:

https://lhcathome.cern.ch/lhcathome/result.php?resultid=157950256
https://lhcathome.cern.ch/lhcathome/result.php?resultid=158351522

The wingman completed these tasks OK, but that doesn't mean there isn't some problem that appears randomly (an improperly initialised pointer, say) that sometimes sends tasks out into the wilderness, bumbling around until they crash. This wastes a terrific amount of CPU time, and it's impossible to see for sure that it has happened. I have recently had one task that ran slower and slower to the point where it had almost stopped, but eventually it completed, and with lots of brownie points.
ID: 32643 · Report as offensive     Reply Quote
Profile Yeti
Volunteer moderator
Avatar

Send message
Joined: 2 Sep 04
Posts: 455
Credit: 200,203,522
RAC: 46,621
Message 32645 - Posted: 6 Oct 2017, 8:18:00 UTC - in response to Message 32643.  

I've aborted most of these tasks, but I have let two run to the bitter conclusion:

https://lhcathome.cern.ch/lhcathome/result.php?resultid=157950256

I have taken a look at this result. I see you are using CPU-Throttling 50%. It is supposed to work but I think it is possible to be a reason for your problems.

Instead of CPU-Throttling 50% why don't you limit the number of cores to 50% ? This will give you the same effect, but the calculations inside the VM may run smoother.

Are you running 1-Core-WUs ?


Supporting BOINC, a great concept !
ID: 32645 · Report as offensive     Reply Quote
Brummig
Avatar

Send message
Joined: 9 Feb 16
Posts: 48
Credit: 537,111
RAC: 0
Message 32647 - Posted: 6 Oct 2017, 9:35:47 UTC - in response to Message 32645.  
Last modified: 6 Oct 2017, 9:36:48 UTC

Actually I do both. On the machine that crunches for LHC, there are 8 CPUs on the processor, and I let BOINC use four of them, running at 50%. This keeps the machine responsive for me, and ensures the fans don't run with excessive noise (I use non-dedicated machines for BOINC, as per the original intention). I let Atlas use two of the processors, and non-Atlas tasks use the other two (or all four if there's no Atlas task). I did try running Atlas with one CPU, but then I had even more tasks that ended in a slow car crash. With Atlas using two processors, fewer Atlas tasks fail this way, but it's only Atlas tasks that are (routinely) failing, and this has only been happening recently.
ID: 32647 · Report as offensive     Reply Quote
Profile Yeti
Volunteer moderator
Avatar

Send message
Joined: 2 Sep 04
Posts: 455
Credit: 200,203,522
RAC: 46,621
Message 32648 - Posted: 6 Oct 2017, 9:41:56 UTC - in response to Message 32647.  

HM 1 and 2 Core-Tasks are known to be a little bit critical, if you use "Standard config out of the box".

Atlas-Tasks are variing regarding needed memory and with standard-config it may happen that a process inside the VM gets not enough memory and then this WU will fail.

For best solution there are 2 ways:

1) Switch to 3-Core-WUs (or even bigger), they are better running out of the box.
2) Set up an app_config that gives your 1/2-Core-tasks more memory


Supporting BOINC, a great concept !
ID: 32648 · Report as offensive     Reply Quote
Brummig
Avatar

Send message
Joined: 9 Feb 16
Posts: 48
Credit: 537,111
RAC: 0
Message 32653 - Posted: 6 Oct 2017, 13:59:40 UTC - in response to Message 32648.  

Ok, thanks. I'll try 3 core first, as I got really fed up with the endless fiddling about with app_config, and dumped it when I saw I could set the number of cores in the LHC settings.
ID: 32653 · Report as offensive     Reply Quote
Profile HerveUAE
Avatar

Send message
Joined: 18 Dec 16
Posts: 123
Credit: 37,495,365
RAC: 0
Message 32664 - Posted: 7 Oct 2017, 4:53:08 UTC - in response to Message 32653.  

Here is an example of an app_config.xml that should work for you if you want to go back to 2 cores:
<?xml version="1.0"?>
<app_config>
<project_max_concurrent>1</project_max_concurrent>
<app>
<name>ATLAS</name>
<max_concurrent>1</max_concurrent>
</app>
<app_version>
<app_name>ATLAS</app_name>
<avg_ncpus>2.000000</avg_ncpus>
<plan_class>vbox64_mt_mcore_atlas</plan_class>
<cmdline>--memory_size_mb 5000</cmdline>
</app_version>
</app_config>
We are the product of random evolution.
ID: 32664 · Report as offensive     Reply Quote
Profile HerveUAE
Avatar

Send message
Joined: 18 Dec 16
Posts: 123
Credit: 37,495,365
RAC: 0
Message 32665 - Posted: 7 Oct 2017, 5:01:52 UTC - in response to Message 32643.  

I have recently had one task that ran slower and slower to the point where it had almost stopped

I guess you make reference to the Progress of the task which does not increase continuously with time, but increase less and less over time. This does not mean that your task stops processing or is processing slower.
What it means is that the initial estimation of the time needed to complete was far less than the actual time needed. Hence in order not to reach a progress above 100%, the progress increases slower and slower when it gets near to the 100%.
This is normal behaviour with ATLAS tasks.
We are the product of random evolution.
ID: 32665 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1785
Credit: 117,277,315
RAC: 71,504
Message 32667 - Posted: 7 Oct 2017, 5:44:50 UTC - in response to Message 32664.  

Here is an example of an app_config.xml that should work for you if you want to go back to 2 cores:
<cmdline>--memory_size_mb 5000</cmdline>

Based on experience of the recent months, I would strongly recommend to set the memory size to 6000MB or even higher.
In most of the cases, console 3 shows a memory usage of slightly above 5GB, but I have had tasks where it went up to more than 6GB.
So, to be on the save side, my setting (with 32GB in my box) is
<cmdline>--memory_size_mb 7000</cmdline>
ID: 32667 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1785
Credit: 117,277,315
RAC: 71,504
Message 32685 - Posted: 8 Oct 2017, 4:55:34 UTC - in response to Message 32667.  

Based on experience of the recent months, I would strongly recommend to set the memory size to 6000MB or even higher.
In most of the cases, console 3 shows a memory usage of slightly above 5GB, but I have had tasks where it went up to more than 6GB.
So, to be on the save side, my setting (with 32GB in my box) is
<cmdline>--memory_size_mb 7000</cmdline>

Last night, I had two ATLAS 2-core tasks running, each of them, according to the info in console_3, using up to 6,2GB RAM.
ID: 32685 · Report as offensive     Reply Quote
1 · 2 · Next

Message boards : ATLAS application : Atlas task running over 45 hours, 100% complete


©2024 CERN