Thread 'Atlas task running over 45 hours, 100% complete'

Author	Message
thomasroderick Send message Joined: 22 May 17 Posts: 16 Credit: 1,257,383 RAC: 0	Message 32254 - Posted: 5 Sep 2017, 13:58:28 UTC My computer has been humming away for a couple weeks, loading 8 tasks at a time, and running through them one by one, at about 2 hours per task. A couple days ago, 1 task started, and currently sits at 100.000% complete after 1d 21:46:25 elapsed time. It was going at normal rate until it hit 97% (after about 2 hours), and then has crawled to 100% over the next 43 hours. No other tasks started or ran, 4 CPUs devoted to this one task. I tried suspending, resuming, updating the project, restarting BOINC, rebooted the computer... nothing has kicked it over. I have suspended and resumed other tasks, and they are all running and completing appropriately. This is task: 154132448, Work Unit: 73907132. It has a deadline about 13 hours from right now. I do not really care about the credit. I simply hate to see a completed research effort get destroyed. Any thoughts on how to get this over the line? Or is this a case of aborting the task and moving on? Have not seen anything in the logs to indicate there was an issue, and other tasks around it did not have problems. Thoughts greatly appreciated, thank you in advance. - Tom. ID: 32254 · Reply Quote

Yeti Volunteer moderator Send message Joined: 2 Sep 04 Posts: 468 Credit: 224,935,712 RAC: 3,282	Message 32262 - Posted: 5 Sep 2017, 14:44:36 UTC Take a short journey through my checklist Point 16e and following. Supporting BOINC, a great concept ! ID: 32262 · Reply Quote

Yeti Volunteer moderator Send message Joined: 2 Sep 04 Posts: 468 Credit: 224,935,712 RAC: 3,282	Message 32264 - Posted: 5 Sep 2017, 14:46:07 UTC And look here: https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4422 Supporting BOINC, a great concept ! ID: 32264 · Reply Quote

thomasroderick Send message Joined: 22 May 17 Posts: 16 Credit: 1,257,383 RAC: 0	Message 32286 - Posted: 5 Sep 2017, 18:38:27 UTC - in response to Message 32264. Thank you, Yeti, for the assistance. Greatly appreciated. I have run through the checklist previously, looked at 16e specifically today. In the VM, I can get to the login and password screen, that loads quickly. I tried the Alt/F2 to see what was processing. The screen reads, "Event Processing information will appear here" and the screen is black. Of course, the task says it is 100% progress, but the elapsed time is continuing to run. I have had two other Atlas tasks run and complete this morning, while this one was suspended. Any other suggestions? Or is this one simply a lost cause... ID: 32286 · Reply Quote

Yeti Volunteer moderator Send message Joined: 2 Sep 04 Posts: 468 Credit: 224,935,712 RAC: 3,282	Message 32293 - Posted: 5 Sep 2017, 20:32:58 UTC - in response to Message 32286. Today in the evening I added some more Details to 16e and 17. Please check again and let me know if this has helped you to make a decision Supporting BOINC, a great concept ! ID: 32293 · Reply Quote

thomasroderick Send message Joined: 22 May 17 Posts: 16 Credit: 1,257,383 RAC: 0	Message 32298 - Posted: 6 Sep 2017, 2:26:36 UTC - in response to Message 32293. Here's what I got on the Properties: CPU Last Checkpoint: 47:20 CPU - Time: 47:20 Elapsed Time: 1d 21:56:42 Every subsequent check was similar, the CPU Last Check and Time increased and were the same, elapsed goes up. Other Properties: Received 9/2/2017 10:02:24am Report Deadline: 9/5/2017 10:02:23pm Est. Computation size: 16,020 GFLOPS Est. Time Remaining ----- Fraction Done: 100.000% Virtual mem size: 112.37MB Working set size: 5.66 GB Progress Rate: 2.160% per hour Alas... it appears I may have run out of time. The deadline is only 45 minutes from now. I will let it run until then and see what happens. The other Atlas tasks will kick in after it clears. Thank you for your your comments and assistance. The checklist has been beneficial as well. ID: 32298 · Reply Quote

Yeti Volunteer moderator Send message Joined: 2 Sep 04 Posts: 468 Credit: 224,935,712 RAC: 3,282	Message 32313 - Posted: 6 Sep 2017, 17:57:53 UTC - in response to Message 32298. From the properties it looks fine for a 1-Core-WU Could you check with ALT/F1 - ALT/F3 ? Supporting BOINC, a great concept ! ID: 32313 · Reply Quote

thomasroderick Send message Joined: 22 May 17 Posts: 16 Credit: 1,257,383 RAC: 0	Message 32331 - Posted: 7 Sep 2017, 14:10:03 UTC - in response to Message 32313. F1: Immediately takes me to the login. F2: Empty black screen, save for the single line at the top, "Event Processing information will appear here." But no additional lines of information. F3: Image below. ID: 32331 · Reply Quote

Yeti Volunteer moderator Send message Joined: 2 Sep 04 Posts: 468 Credit: 224,935,712 RAC: 3,282	Message 32333 - Posted: 7 Sep 2017, 14:50:52 UTC - in response to Message 32331. Looks good for a 4-Core-WU Supporting BOINC, a great concept ! ID: 32333 · Reply Quote

thomasroderick Send message Joined: 22 May 17 Posts: 16 Credit: 1,257,383 RAC: 0	Message 32336 - Posted: 7 Sep 2017, 17:56:03 UTC - in response to Message 32333. I let it run for a little while longer... elapsed time of 2d 1:33:33. Still sitting at 100% with ----- remaining. I checked my tasks online and that specific one is now saying, "Timed out - no response." So it appears this one will be lost, and I will abort from my system. Thanks for looking into the situation. It has been a valuable learning experience for me, with you guiding me through. ID: 32336 · Reply Quote

thomasroderick Send message Joined: 22 May 17 Posts: 16 Credit: 1,257,383 RAC: 0	Message 32338 - Posted: 7 Sep 2017, 20:45:08 UTC - in response to Message 32336. It finally gave up the ghost a few minutes ago. On the BOINC manager, came up with a status: Aborted, File disk full. The task output can be found at: https://lhcathome.cern.ch/lhcathome/result.php?resultid=154132448 Run time and CPU time were drastically different, so there was something corrupt with my working on this task. Maybe a power or network glitch or something. Chalk it up to the gremlins. ID: 32338 · Reply Quote

Brummig Send message Joined: 9 Feb 16 Posts: 50 Credit: 546,878 RAC: 0	Message 32643 - Posted: 6 Oct 2017, 7:56:13 UTC I'm seeing this too. Most tasks complete normally, but a significant number go slower and slower and slower, and (usually) eventually fail. The information revealed by the Properties button indicates they are working, and the VM console confirms this (Alt-F3 shows two athena tasks working away like crazy as expected, and Alt-F2 shows events happening). I've aborted most of these tasks, but I have let two run to the bitter conclusion: https://lhcathome.cern.ch/lhcathome/result.php?resultid=157950256 https://lhcathome.cern.ch/lhcathome/result.php?resultid=158351522 The wingman completed these tasks OK, but that doesn't mean there isn't some problem that appears randomly (an improperly initialised pointer, say) that sometimes sends tasks out into the wilderness, bumbling around until they crash. This wastes a terrific amount of CPU time, and it's impossible to see for sure that it has happened. I have recently had one task that ran slower and slower to the point where it had almost stopped, but eventually it completed, and with lots of brownie points. ID: 32643 · Reply Quote

Yeti Volunteer moderator Send message Joined: 2 Sep 04 Posts: 468 Credit: 224,935,712 RAC: 3,282	Message 32645 - Posted: 6 Oct 2017, 8:18:00 UTC - in response to Message 32643. I've aborted most of these tasks, but I have let two run to the bitter conclusion: https://lhcathome.cern.ch/lhcathome/result.php?resultid=157950256 I have taken a look at this result. I see you are using CPU-Throttling 50%. It is supposed to work but I think it is possible to be a reason for your problems. Instead of CPU-Throttling 50% why don't you limit the number of cores to 50% ? This will give you the same effect, but the calculations inside the VM may run smoother. Are you running 1-Core-WUs ? Supporting BOINC, a great concept ! ID: 32645 · Reply Quote

Brummig Send message Joined: 9 Feb 16 Posts: 50 Credit: 546,878 RAC: 0	Message 32647 - Posted: 6 Oct 2017, 9:35:47 UTC - in response to Message 32645. Last modified: 6 Oct 2017, 9:36:48 UTC Actually I do both. On the machine that crunches for LHC, there are 8 CPUs on the processor, and I let BOINC use four of them, running at 50%. This keeps the machine responsive for me, and ensures the fans don't run with excessive noise (I use non-dedicated machines for BOINC, as per the original intention). I let Atlas use two of the processors, and non-Atlas tasks use the other two (or all four if there's no Atlas task). I did try running Atlas with one CPU, but then I had even more tasks that ended in a slow car crash. With Atlas using two processors, fewer Atlas tasks fail this way, but it's only Atlas tasks that are (routinely) failing, and this has only been happening recently. ID: 32647 · Reply Quote

Yeti Volunteer moderator Send message Joined: 2 Sep 04 Posts: 468 Credit: 224,935,712 RAC: 3,282	Message 32648 - Posted: 6 Oct 2017, 9:41:56 UTC - in response to Message 32647. HM 1 and 2 Core-Tasks are known to be a little bit critical, if you use "Standard config out of the box". Atlas-Tasks are variing regarding needed memory and with standard-config it may happen that a process inside the VM gets not enough memory and then this WU will fail. For best solution there are 2 ways: 1) Switch to 3-Core-WUs (or even bigger), they are better running out of the box. 2) Set up an app_config that gives your 1/2-Core-tasks more memory Supporting BOINC, a great concept ! ID: 32648 · Reply Quote

Brummig Send message Joined: 9 Feb 16 Posts: 50 Credit: 546,878 RAC: 0	Message 32653 - Posted: 6 Oct 2017, 13:59:40 UTC - in response to Message 32648. Ok, thanks. I'll try 3 core first, as I got really fed up with the endless fiddling about with app_config, and dumped it when I saw I could set the number of cores in the LHC settings. ID: 32653 · Reply Quote

HerveUAE Send message Joined: 18 Dec 16 Posts: 123 Credit: 37,495,365 RAC: 0	Message 32664 - Posted: 7 Oct 2017, 4:53:08 UTC - in response to Message 32653. Here is an example of an app_config.xml that should work for you if you want to go back to 2 cores: <?xml version="1.0"?> <app_config> <project_max_concurrent>1</project_max_concurrent> <app> <name>ATLAS</name> <max_concurrent>1</max_concurrent> </app> <app_version> <app_name>ATLAS</app_name> <avg_ncpus>2.000000</avg_ncpus> <plan_class>vbox64_mt_mcore_atlas</plan_class> <cmdline>--memory_size_mb 5000</cmdline> </app_version> </app_config> We are the product of random evolution. ID: 32664 · Reply Quote

HerveUAE Send message Joined: 18 Dec 16 Posts: 123 Credit: 37,495,365 RAC: 0	Message 32665 - Posted: 7 Oct 2017, 5:01:52 UTC - in response to Message 32643. I have recently had one task that ran slower and slower to the point where it had almost stopped I guess you make reference to the Progress of the task which does not increase continuously with time, but increase less and less over time. This does not mean that your task stops processing or is processing slower. What it means is that the initial estimation of the time needed to complete was far less than the actual time needed. Hence in order not to reach a progress above 100%, the progress increases slower and slower when it gets near to the 100%. This is normal behaviour with ATLAS tasks. We are the product of random evolution. ID: 32665 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1984 Credit: 162,151,365 RAC: 87,019	Message 32667 - Posted: 7 Oct 2017, 5:44:50 UTC - in response to Message 32664. Here is an example of an app_config.xml that should work for you if you want to go back to 2 cores: <cmdline>--memory_size_mb 5000</cmdline> Based on experience of the recent months, I would strongly recommend to set the memory size to 6000MB or even higher. In most of the cases, console 3 shows a memory usage of slightly above 5GB, but I have had tasks where it went up to more than 6GB. So, to be on the save side, my setting (with 32GB in my box) is <cmdline>--memory_size_mb 7000</cmdline> ID: 32667 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1984 Credit: 162,151,365 RAC: 87,019	Message 32685 - Posted: 8 Oct 2017, 4:55:34 UTC - in response to Message 32667. Based on experience of the recent months, I would strongly recommend to set the memory size to 6000MB or even higher. In most of the cases, console 3 shows a memory usage of slightly above 5GB, but I have had tasks where it went up to more than 6GB. So, to be on the save side, my setting (with 32GB in my box) is <cmdline>--memory_size_mb 7000</cmdline> Last night, I had two ATLAS 2-core tasks running, each of them, according to the info in console_3, using up to 6,2GB RAM. ID: 32685 · Reply Quote