Message boards :
ATLAS application :
Atlas task running over 45 hours, 100% complete
Message board moderation
Author | Message |
---|---|
Send message Joined: 22 May 17 Posts: 15 Credit: 1,226,011 RAC: 4 |
My computer has been humming away for a couple weeks, loading 8 tasks at a time, and running through them one by one, at about 2 hours per task. A couple days ago, 1 task started, and currently sits at 100.000% complete after 1d 21:46:25 elapsed time. It was going at normal rate until it hit 97% (after about 2 hours), and then has crawled to 100% over the next 43 hours. No other tasks started or ran, 4 CPUs devoted to this one task. I tried suspending, resuming, updating the project, restarting BOINC, rebooted the computer... nothing has kicked it over. I have suspended and resumed other tasks, and they are all running and completing appropriately. This is task: 154132448, Work Unit: 73907132. It has a deadline about 13 hours from right now. I do not really care about the credit. I simply hate to see a completed research effort get destroyed. Any thoughts on how to get this over the line? Or is this a case of aborting the task and moving on? Have not seen anything in the logs to indicate there was an issue, and other tasks around it did not have problems. Thoughts greatly appreciated, thank you in advance. - Tom. |
Send message Joined: 2 Sep 04 Posts: 455 Credit: 201,270,405 RAC: 4,379 |
Take a short journey through my checklist Point 16e and following. Supporting BOINC, a great concept ! |
Send message Joined: 2 Sep 04 Posts: 455 Credit: 201,270,405 RAC: 4,379 |
And look here: https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4422 Supporting BOINC, a great concept ! |
Send message Joined: 22 May 17 Posts: 15 Credit: 1,226,011 RAC: 4 |
Thank you, Yeti, for the assistance. Greatly appreciated. I have run through the checklist previously, looked at 16e specifically today. In the VM, I can get to the login and password screen, that loads quickly. I tried the Alt/F2 to see what was processing. The screen reads, "Event Processing information will appear here" and the screen is black. Of course, the task says it is 100% progress, but the elapsed time is continuing to run. I have had two other Atlas tasks run and complete this morning, while this one was suspended. Any other suggestions? Or is this one simply a lost cause... |
Send message Joined: 2 Sep 04 Posts: 455 Credit: 201,270,405 RAC: 4,379 |
|
Send message Joined: 22 May 17 Posts: 15 Credit: 1,226,011 RAC: 4 |
Here's what I got on the Properties: CPU Last Checkpoint: 47:20 CPU - Time: 47:20 Elapsed Time: 1d 21:56:42 Every subsequent check was similar, the CPU Last Check and Time increased and were the same, elapsed goes up. Other Properties: Received 9/2/2017 10:02:24am Report Deadline: 9/5/2017 10:02:23pm Est. Computation size: 16,020 GFLOPS Est. Time Remaining ----- Fraction Done: 100.000% Virtual mem size: 112.37MB Working set size: 5.66 GB Progress Rate: 2.160% per hour Alas... it appears I may have run out of time. The deadline is only 45 minutes from now. I will let it run until then and see what happens. The other Atlas tasks will kick in after it clears. Thank you for your your comments and assistance. The checklist has been beneficial as well. |
Send message Joined: 2 Sep 04 Posts: 455 Credit: 201,270,405 RAC: 4,379 |
|
Send message Joined: 22 May 17 Posts: 15 Credit: 1,226,011 RAC: 4 |
F1: Immediately takes me to the login. F2: Empty black screen, save for the single line at the top, "Event Processing information will appear here." But no additional lines of information. F3: Image below. |
Send message Joined: 2 Sep 04 Posts: 455 Credit: 201,270,405 RAC: 4,379 |
|
Send message Joined: 22 May 17 Posts: 15 Credit: 1,226,011 RAC: 4 |
I let it run for a little while longer... elapsed time of 2d 1:33:33. Still sitting at 100% with ----- remaining. I checked my tasks online and that specific one is now saying, "Timed out - no response." So it appears this one will be lost, and I will abort from my system. Thanks for looking into the situation. It has been a valuable learning experience for me, with you guiding me through. |
Send message Joined: 22 May 17 Posts: 15 Credit: 1,226,011 RAC: 4 |
It finally gave up the ghost a few minutes ago. On the BOINC manager, came up with a status: Aborted, File disk full. The task output can be found at: https://lhcathome.cern.ch/lhcathome/result.php?resultid=154132448 Run time and CPU time were drastically different, so there was something corrupt with my working on this task. Maybe a power or network glitch or something. Chalk it up to the gremlins. |
Send message Joined: 9 Feb 16 Posts: 48 Credit: 537,111 RAC: 0 |
I'm seeing this too. Most tasks complete normally, but a significant number go slower and slower and slower, and (usually) eventually fail. The information revealed by the Properties button indicates they are working, and the VM console confirms this (Alt-F3 shows two athena tasks working away like crazy as expected, and Alt-F2 shows events happening). I've aborted most of these tasks, but I have let two run to the bitter conclusion: https://lhcathome.cern.ch/lhcathome/result.php?resultid=157950256 https://lhcathome.cern.ch/lhcathome/result.php?resultid=158351522 The wingman completed these tasks OK, but that doesn't mean there isn't some problem that appears randomly (an improperly initialised pointer, say) that sometimes sends tasks out into the wilderness, bumbling around until they crash. This wastes a terrific amount of CPU time, and it's impossible to see for sure that it has happened. I have recently had one task that ran slower and slower to the point where it had almost stopped, but eventually it completed, and with lots of brownie points. |
Send message Joined: 2 Sep 04 Posts: 455 Credit: 201,270,405 RAC: 4,379 |
I've aborted most of these tasks, but I have let two run to the bitter conclusion: I have taken a look at this result. I see you are using CPU-Throttling 50%. It is supposed to work but I think it is possible to be a reason for your problems. Instead of CPU-Throttling 50% why don't you limit the number of cores to 50% ? This will give you the same effect, but the calculations inside the VM may run smoother. Are you running 1-Core-WUs ? Supporting BOINC, a great concept ! |
Send message Joined: 9 Feb 16 Posts: 48 Credit: 537,111 RAC: 0 |
Actually I do both. On the machine that crunches for LHC, there are 8 CPUs on the processor, and I let BOINC use four of them, running at 50%. This keeps the machine responsive for me, and ensures the fans don't run with excessive noise (I use non-dedicated machines for BOINC, as per the original intention). I let Atlas use two of the processors, and non-Atlas tasks use the other two (or all four if there's no Atlas task). I did try running Atlas with one CPU, but then I had even more tasks that ended in a slow car crash. With Atlas using two processors, fewer Atlas tasks fail this way, but it's only Atlas tasks that are (routinely) failing, and this has only been happening recently. |
Send message Joined: 2 Sep 04 Posts: 455 Credit: 201,270,405 RAC: 4,379 |
HM 1 and 2 Core-Tasks are known to be a little bit critical, if you use "Standard config out of the box". Atlas-Tasks are variing regarding needed memory and with standard-config it may happen that a process inside the VM gets not enough memory and then this WU will fail. For best solution there are 2 ways: 1) Switch to 3-Core-WUs (or even bigger), they are better running out of the box. 2) Set up an app_config that gives your 1/2-Core-tasks more memory Supporting BOINC, a great concept ! |
Send message Joined: 9 Feb 16 Posts: 48 Credit: 537,111 RAC: 0 |
Ok, thanks. I'll try 3 core first, as I got really fed up with the endless fiddling about with app_config, and dumped it when I saw I could set the number of cores in the LHC settings. |
Send message Joined: 18 Dec 16 Posts: 123 Credit: 37,495,365 RAC: 0 |
Here is an example of an app_config.xml that should work for you if you want to go back to 2 cores: <?xml version="1.0"?> <app_config> <project_max_concurrent>1</project_max_concurrent> <app> <name>ATLAS</name> <max_concurrent>1</max_concurrent> </app> <app_version> <app_name>ATLAS</app_name> <avg_ncpus>2.000000</avg_ncpus> <plan_class>vbox64_mt_mcore_atlas</plan_class> <cmdline>--memory_size_mb 5000</cmdline> </app_version> </app_config> We are the product of random evolution. |
Send message Joined: 18 Dec 16 Posts: 123 Credit: 37,495,365 RAC: 0 |
I have recently had one task that ran slower and slower to the point where it had almost stopped I guess you make reference to the Progress of the task which does not increase continuously with time, but increase less and less over time. This does not mean that your task stops processing or is processing slower. What it means is that the initial estimation of the time needed to complete was far less than the actual time needed. Hence in order not to reach a progress above 100%, the progress increases slower and slower when it gets near to the 100%. This is normal behaviour with ATLAS tasks. We are the product of random evolution. |
Send message Joined: 18 Dec 15 Posts: 1821 Credit: 118,983,735 RAC: 18,277 |
Here is an example of an app_config.xml that should work for you if you want to go back to 2 cores: Based on experience of the recent months, I would strongly recommend to set the memory size to 6000MB or even higher. In most of the cases, console 3 shows a memory usage of slightly above 5GB, but I have had tasks where it went up to more than 6GB. So, to be on the save side, my setting (with 32GB in my box) is <cmdline>--memory_size_mb 7000</cmdline> |
Send message Joined: 18 Dec 15 Posts: 1821 Credit: 118,983,735 RAC: 18,277 |
Based on experience of the recent months, I would strongly recommend to set the memory size to 6000MB or even higher. Last night, I had two ATLAS 2-core tasks running, each of them, according to the info in console_3, using up to 6,2GB RAM. |
©2024 CERN