Atlas runs very slowly after 94%

Author	Message
Jim Wilkins Send message Joined: 22 Aug 06 Posts: 22 Credit: 466,060 RAC: 0	Message 36679 - Posted: 9 Sep 2018, 16:57:06 UTC I have an atlas task that runs very slowly at 94% complete. I had one last week that failed validation after taking a long time to finish. DO I abort this task? Running OSX on an iMac w/Virtualbox 5.2.8r121009 ID: 36679 · Reply Quote

bronco Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0	Message 36680 - Posted: 9 Sep 2018, 19:37:43 UTC - in response to Message 36679. Last modified: 9 Sep 2018, 19:40:17 UTC I don't see the one that failed to validate in your tasks list. I see only what I assume is the slowpoke plus https://lhcathome.cern.ch/lhcathome/result.php?resultid=206302027 which completed and validated. The stderr output from that task indicates it stopped and started several times. ATLAS tasks need to run start to finish uninterrupted. That one luckily recovered but you won't be that lucky all the time. Also, the stderr output indicates you have the VM throttled to 75%. That's probably totally unnecessary for your reasonably powerful CPU and probably a bad idea for ATLAS tasks. You want ATLAS tasks to start only once and stop only once and go as fast as they can between that start and stop. If you find BOINC is slowing down non-BOINC apps then in your prefs set BOINC to use 3 cores instead of 4. "Suspend when non-BOINC CPU usage is above __ %" should be no less than 90%. Abort the slowpoke you have now? The deadline is 3 days away. What I would do is this: 1) suspend all other projects so they won't preempt the ATLAS task 2) remove the 75% throttle on the VM 3) set "Suspend when non-BOINC CPU usage is above __ %" to at least 90%. 4) suspend the ATLAS task 5) shutdown BOINC client 6) wait 2 minutes to make sure the VM has been saved properly 7) reboot 8) resume the ATLAS task, leave all other tasks/projects suspended ATLAS tasks are extremely fussy. If you treat them like other tasks you will have some that don't even validate in addition to a few that validate but don't do any useful work (no HITS file). To have near 100% success rate you need to make compromises you don't need to make with other tasks. Others suggest setting "Switch between tasks every __ minutes" to some very high number to force tasks to run start to finish uninterrupted. Forget about that. It doesn't work. If you want to run ATLAS with high rate of success then you need to run only ATLAS. If you want to run Theory, Sixtrack and others then you need to either: 1) have 2 BOINC installations one your host: 1 for ATLAS, 1 for the rest OR 2) select only ATLAS in your prefs for a month then change your prefs to exclude ATLAS and run the other apps for a month, then back to ATLAS, flip-flop, flip-flop Not supposed to be that way but that's the way it is. If you can live with a rather high fail rate then don't compromise but if you want high success with ATLAS then ya gotta compromise. ID: 36680 · Reply Quote

Jim Wilkins Send message Joined: 22 Aug 06 Posts: 22 Credit: 466,060 RAC: 0	Message 36685 - Posted: 11 Sep 2018, 20:59:41 UTC - in response to Message 36680. Bronco, Many thanks for the info. Choices, choices! I'll play with the parameters and see what works. Thanks, Jim ID: 36685 · Reply Quote

thomasroderick Send message Joined: 22 May 17 Posts: 16 Credit: 1,257,383 RAC: 7	Message 36689 - Posted: 12 Sep 2018, 14:17:27 UTC - in response to Message 36685. If you search through the threads, there are previous discussions on this. Frequently occurs on my machine as well. The Task will be moving along just fine until it gets closer to completion. Usually somewhere around 80% complete on my machine, and then the Task will extend. For example, a Task estimated to take 6 hours, when it gets to about 4 hours elapsed (and 80% complete), will continue running for the next several hours instead of the estimated 2 hours. I currently have one running that was estimated 6 hours, and as I type is Progress: 80.64%, Elapsed: 11 h 34 m, Remaining: 2 h 46 m. Anytime I see this, it usually means there are at least another 5 to 6 hours before it will finish. For me... this is common. I only run one task at a time, so the machine is committed to working on only this one task. Like I said, my opinion is that you are not experiencing anything unusual or special... it simply happens. - Tom. ID: 36689 · Reply Quote

bronco Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0	Message 36700 - Posted: 13 Sep 2018, 16:43:43 UTC - in response to Message 36689. For me... this is common. I only run one task at a time, so the machine is committed to working on only this one task. Like I said, my opinion is that you are not experiencing anything unusual or special... it simply happens. True, it simply happens. It happens when you have everything setup and configured the way ATLAS needs it to be and it happens when you do not have things setup and configured the way ATLAS requires. If you have a near perfect history of completing ATLAS tasks (complete with HITS file) then it's easy to relax and say "it simply happens" and be confident that the task will complete and verify and produce a HITS file. Not so easy to relax and say "it happens" when 30% or more of one's ATLAS tasks are failing. If one is in that group then it's only natural to wonder, as Jim did, should I abort this one or not. The point is this... if one insists on running ATLAS then there are 2 and only 2 options, they are called DO and DO NOT: 1) DO NOT set things up properly, ignore Yeti's check list and all the other tips and hints spread throughout these threads, wonder whether each and every ATLAS you run is going to fail, every time you spot an irregularity debate with yourself whether to abort the task or not, learn how to love a high task mortality rate, learn how to love the way your computer burns up electricity but returns a useful result only 50% of the time. 2) DO setup every detail in Yeti's checklist, have 99.9% ATLAS success rate, relax, be happy ID: 36700 · Reply Quote

tullio Send message Joined: 19 Feb 08 Posts: 708 Credit: 4,336,250 RAC: 0	Message 36757 - Posted: 18 Sep 2018, 15:17:42 UTC One Atlas task on the Windows 10 PC, two cores, is running at 99.999%. Another was canceled by the server. Tullio ID: 36757 · Reply Quote

gyllic Send message Joined: 9 Dec 14 Posts: 202 Credit: 2,539,793 RAC: 261	Message 36783 - Posted: 20 Sep 2018, 9:16:39 UTC - in response to Message 36771. If you are referecing to this task https://lhcathome.cern.ch/lhcathome/result.php?resultid=206669059 you can see that the Error message is "196 (0x000000C4) EXIT_DISK_LIMIT_EXCEEDED". So maybe you have very little free storage left on your hard drive or maybe your configuration is wrong. In the logs you can also see that the VM state changes very often which indicates that something is wrong, e.g.: 2018-09-14 06:12:15 (5480): VM state change detected. (old = 'running', new = 'paused') 2018-09-14 06:18:23 (5480): VM state change detected. (old = 'paused', new = 'running') A good idea would be to reduce the maximal cores per task from 8 to 1 and try to get it up running with 1 core and to increase the allowed disk space for BOINC. Once you have crunched a couple of tasks successfully with one core, you can increase the number of cores per task. Also very helpful is Yeti's checklist here (see point 6) https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4161 ID: 36783 · Reply Quote

bronco Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0	Message 36788 - Posted: 20 Sep 2018, 14:11:35 UTC - in response to Message 36783. If you are referecing to this task https://lhcathome.cern.ch/lhcathome/result.php?resultid=206669059 you can see that the Error message is "196 (0x000000C4) EXIT_DISK_LIMIT_EXCEEDED". So maybe you have very little free storage left on your hard drive or maybe your configuration is wrong. "196 (0x000000C4) EXIT_DISK_LIMIT_EXCEEDED" means the max disk space the task is allowed to use, as defined by <rsc_disk_bound> in client_state.xml, has been exceeded. It can (and frequently does) happen even if you have a terabyte of free disk space. Neither freeing up tons of disk space nor fiddling with the allowed disk space settings in preferences will fix this problem. "196 (0x000000C4) EXIT_DISK_LIMIT_EXCEEDED" can be caused by VBox writing an extremely large snapshot file in the task's slot folder if/when the task is suspended or preempted by another task. The problem has been discussed in at least one other thread already. The best way to prevent this error is to adjust preferences and usage patterns to ensure ATLAS tasks run from start to finish with no interruptions. ID: 36788 · Reply Quote

Ola Send message Joined: 7 Apr 18 Posts: 20 Credit: 137,327 RAC: 0	Message 36971 - Posted: 7 Oct 2018, 18:13:49 UTC All my recent tasks are calculates twice slower than expected (it means every two "real" seconds goes by one second in expected time) and rapidly slow down after 94-96% done. I'm not sure if I did everything from Yeti's checklist but there were never problems up now. I'm afraid there may be some problems with the tasks. Moreover I can't not to interrupt the tasks because I need to turn off my computer at night or when I leave home. I'm sorry for mistakes, I'm not native English speaker. ID: 36971 · Reply Quote

bronco Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0	Message 36972 - Posted: 7 Oct 2018, 22:33:05 UTC - in response to Message 36971. Last modified: 7 Oct 2018, 22:34:37 UTC All my recent tasks are calculates twice slower than expected (it means every two "real" seconds goes by one second in expected time) This is normal. It is because they cannot accurately calculate the amount of time the tasks require. and rapidly slow down after 94-96% done. This seems to be normal too. I'm not sure if I did everything from Yeti's checklist but there were never problems up now. I'm afraid there may be some problems with the tasks. Moreover I can't not to interrupt the tasks because I need to turn off my computer at night or when I leave home. If they run with no interruptions then 99% of ATLAS tasks will complete and validate so it seems to me ATLAS tasks are OK. If you cannot avoid turning off your computer then you should consider running other projects/tasks and forget about ATLAS because there is no way to work around the problem. ID: 36972 · Reply Quote

Ola Send message Joined: 7 Apr 18 Posts: 20 Credit: 137,327 RAC: 0	Message 36986 - Posted: 9 Oct 2018, 18:34:15 UTC - in response to Message 36972. But there haven't been any problems with them! In addition, I use to even turn off my computer without disconnecting the client and every task was crunched normally. So I think it is a recent problem with tasks. Moreover four from five of my current tasks have been just stopped by the server, so something is in the air ;) ID: 36986 · Reply Quote

bronco Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0	Message 36990 - Posted: 9 Oct 2018, 21:54:25 UTC - in response to Message 36986. But there haven't been any problems with them! In addition, I use to even turn off my computer without disconnecting the client and every task was crunched normally. They used to validate results even if they failed to do any useful work (no HITS file). Then I complained to them that validating tasks that don't do any useful work gives volunteers the false impression that everything is working fine when it is not. I think maybe they recently changed their policy regarding validating tasks that fail which is a good thing because it lets users know they are doing something wrong and motivates them to correct it. So I think it is a recent problem with tasks. If it was a problem with the tasks you would see several other volunteers reporting it too. Moreover four from five of my current tasks have been just stopped by the server, so something is in the air ;) Those 4 were stopped due to additional iterations of the task failing to download, for example https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=101992161. Failure to download is not really an error in the task it's a failure on the server. ID: 36990 · Reply Quote

AuxRx Send message Joined: 16 Sep 17 Posts: 100 Credit: 1,618,469 RAC: 0	Message 36993 - Posted: 10 Oct 2018, 5:55:17 UTC - in response to Message 36990. I think maybe they recently changed their policy regarding validating tasks that fail which is a good thing because it lets users know they are doing something wrong and motivates them to correct it. definitely not. most of my work consists of re-sends because some clients have hundreds of failed tasks (no vbox most likely) - and no one cares on either end. The volunteers just keep downloading new work. If it was a problem with the tasks you would see several other volunteers reporting it too. There are issues with recent task, hence this thread "runs very slowly after xx%". I'm seeing the same behaviour, I'm not as worried though. When the server died on Oct 3rd, i had several tasks run 10+hours. ATLAS depends on the underlying infrastructure and there still seem to be issues. ID: 36993 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2681 Credit: 286,838,618 RAC: 64,689	Message 36994 - Posted: 10 Oct 2018, 7:10:17 UTC - in response to Message 36986. Pausing and resuming a VBox task costs lots of resources. The more cores you have configured the more resources are required and the more time it needs. The following lines from your stderr.txt show that you run an 8-core setup: 2018-10-01 22:20:32 (13240): Setting Memory Size for VM. (10200MB) 2018-10-01 22:20:33 (13240): Setting CPU Count for VM. (8) This is most likely too much on your host and results in the following lines: 2018-10-02 00:47:44 (5040): Stopping VM. 00:47:55 (5040): BOINC client no longer exists - exiting 00:47:55 (5040): timer handler: client dead, exiting 00:48:05 (5040): BOINC client no longer exists - exiting 00:48:05 (5040): timer handler: client dead, exiting 00:48:15 (5040): BOINC client no longer exists - exiting 2018-10-02 21:57:51 (8796): VM state change detected. (old = 'running', new = 'paused') 2018-10-02 21:59:16 (8796): VM state change detected. (old = 'paused', new = 'running') 2018-10-02 23:59:28 (13212): Error in host info for VM: -108 2018-10-02 23:59:28 (13212): WARNING: Communication with VM Hypervisor failed. (Possibly Out of Memory). 2018-10-02 23:59:28 (13212): WARNING: Communication with VM Hypervisor failed. 2018-10-02 23:59:28 (13212): Could not communicate with VM Hypervisor. Rescheduling execution for a later date. 2018-10-03 16:27:14 (8460): VM state change detected. (old = 'running', new = 'paused') 2018-10-03 16:27:25 (8460): VM state change detected. (old = 'paused', new = 'running') 2018-10-06 00:51:01 (2028): VM did not stop when requested. 2018-10-06 00:51:01 (2028): VM was successfully terminated. Your host tries hard to recover from all those errors but at the end there are too many of them. To solve the situation you may try a setup with not more than 4 cores, better only 1 or 2 cores. At least until you get a stable system. ID: 36994 · Reply Quote

bronco Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0	Message 36995 - Posted: 10 Oct 2018, 7:57:15 UTC - in response to Message 36993. I think maybe they recently changed their policy regarding validating tasks that fail which is a good thing because it lets users know they are doing something wrong and motivates them to correct it. definitely not. I absolutely recall a post from one of the admins stating they has revised the policy and would be failing tasks they were not previously failing. I can agree that it might not be motivating all volunteers to correct configuration problems on their hosts. If it was a problem with the tasks you would see several other volunteers reporting it too. There are issues with recent task, hence this thread "runs very slowly after xx%". I'm seeing the same behaviour, I'm not as worried though. I see absolutely no evidence that "runs very slowly after 94%" even indicates a problem. Maybe that's just the way they work. Or maybe it's because BOINC is unable to monitor the progress accurately. When the server died on Oct 3rd, i had several tasks run 10+hours. ATLAS depends on the underlying infrastructure and there still seem to be issues. Indeed there are issues with the infrastructure as you mentioned. Another one is the recurring rash of failed downloads that hits all hosts every so often. When I said "If it was a problem with the tasks you would see several other volunteers reporting it too" I was not trying to say there are no problems with ATLAS tasks. I was trying to say Oha's failed tasks are not due to a problem with the ATLAS tasks but due rather to the tasks being interrupted repeatedly. ID: 36995 · Reply Quote

Jim Wilkins Send message Joined: 22 Aug 06 Posts: 22 Credit: 466,060 RAC: 0	Message 37008 - Posted: 11 Oct 2018, 14:30:55 UTC - in response to Message 36700. I will admit that I have nor executed Yeti's checklist. I thought it was PC oriented and I am running a Mac. I will check it out. It does seem that as long as the computer is up, ATLAS completes. But if anything stops Atlas execution, it usually fails. I will check out the checklist. Thanks, Jim ID: 37008 · Reply Quote

LHC@home