Atlas tasks "Postponed: VM job unmanageable, restarting later."

Author	Message
bronco Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0	Message 35923 - Posted: 15 Jul 2018, 21:40:14 UTC - in response to Message 35911. 3. a script that periodically (once per minute) checks for recently started VMs/vboxwrappers and renices their nice level. If that script is not too long could you post it here, maybe in a separate thread? Or PM it to me? I would love to have a look at it and maybe incorporate it into the LHC babysitter script I've been working on for these past months. ID: 35923 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2411 Credit: 226,203,416 RAC: 129,829	Message 35928 - Posted: 16 Jul 2018, 4:15:46 UTC - in response to Message 35923. If that script is not too long ... It's just 1 line in /etc/crontab * * * * * root for VWRAP in $(pgrep -u boinc -l\| egrep -i 'wrapper' \|cut -d" " -f1); do renice -n 10 $VWRAP; done >/dev/null 2>&1 ID: 35928 · Reply Quote

bronco Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0	Message 35940 - Posted: 16 Jul 2018, 13:41:34 UTC - in response to Message 35928. bash is best ID: 35940 · Reply Quote

Toby Broom Volunteer moderator Send message Joined: 27 Sep 08 Posts: 807 Credit: 652,067,873 RAC: 291,037	Message 35950 - Posted: 17 Jul 2018, 20:11:42 UTC - in response to Message 35911. Last modified: 17 Jul 2018, 21:36:02 UTC it would appear that <process_priority_special> doesn't do anything for this project, the wrapper still runs at low which is contrary to the documentation. the normal process priorty boost it up as documented. not off to a good start, with all processes running as normal then I still got some unmanageable after a few hours but wiil give some time. ID: 35950 · Reply Quote

Toby Broom Volunteer moderator Send message Joined: 27 Sep 08 Posts: 807 Credit: 652,067,873 RAC: 291,037	Message 35952 - Posted: 18 Jul 2018, 16:38:45 UTC - in response to Message 35950. it's maybe better with the change in Prio but still not 0 so same impact with no more tasks as it's blocks the queue. Given this I quit my work for ATLAS project as the effort to babysit the project is too much, with the poor scheduling, ram configuration and now this. If there is some changes then I will try again in the future. ID: 35952 · Reply Quote

adrianxw Send message Joined: 29 Sep 04 Posts: 187 Credit: 705,487 RAC: 0	Message 36002 - Posted: 23 Jul 2018, 7:14:44 UTC - in response to Message 35952. I just got back yesterday, hence delay. I have not seen anything that addresses the issue that I see and am unhappy with. I have detatched my machines from LHC. Wave upon wave of demented avengers march cheerfully out of obscurity into the dream. ID: 36002 · Reply Quote

Toby Broom Volunteer moderator Send message Joined: 27 Sep 08 Posts: 807 Credit: 652,067,873 RAC: 291,037	Message 36149 - Posted: 31 Jul 2018, 15:46:29 UTC Since I'm on windows, I made a Powershell script that runs and aborts the stuck tasks automatically. I downgraded to the 5.1.x Vbox and this improved things as well I think. ID: 36149 · Reply Quote

bronco Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0	Message 36152 - Posted: 31 Jul 2018, 17:45:29 UTC - in response to Message 36149. Since I'm on windows, I made a Powershell script that runs and aborts the stuck tasks automatically. Then you run the risk of a scenario where the host is doing only 1 of 2 things: it's either 1) aborting a stuck task and downloading another one OR 2) in the process of turning another viable task into a stuck task. It's a vicious cycle... get a new task, hope to get lucky, blow it, get a new task, hope to get lucky, blow it, get a new task, hope to get lucky, blow it, and on and on and on. I downgraded to the 5.1.x Vbox and this improved things as well I think. We all want to believe that our shiny new box with the CPU we've been drooling over for months coupled with more RAM than we've ever crammed onto a mobo before, fast drives, etc. can do anything we throw at it. Then reality sets in. And what do we do? Human nature is to deny and blame everybody but ourselves. I've been caught up in the "oh, it's gotta be a flaky version of this or that app" game before and sometimes it is. More often than not I am simply expecting too much of a rig I thought was absolutely invincible but instead we torture ourselves with weeks of re-installing this, rolling back to that, searching for the magic combination of X, Y and Z and other fruitless endeavor. Not sure how many cores you're trying to run ATLAS on but sometimes we get more done by simply trying to do less than the max and doing that well. Sometimes we need to try doing just the minimum until we can do that each and every time consistently and only once we reach that level should we kick it up a small notch. ID: 36152 · Reply Quote

Toby Broom Volunteer moderator Send message Joined: 27 Sep 08 Posts: 807 Credit: 652,067,873 RAC: 291,037	Message 36153 - Posted: 31 Jul 2018, 18:51:31 UTC Last modified: 31 Jul 2018, 20:48:46 UTC I guess my previous actions were the best. As if we don't try and work around the issues and stop working for the project then they would address the request that we have. I only run one core task on my machines as it the most efficient, I have one that run dual core as there is a limit on the number of tasks that you can run. compared to the other projects I don't contribute much to ATLAS as it's not possible to configure my computers to do more when shared with the other projects. I expected it to work well as it was working well before so that set my expectations, hence the roll back. If the task is postponed then in my experience it's never come back from that state, I think if you restart Boinc, then it will come back. The biggest problem with that state is it stops boinc from getting new tasks so when left un-managed it drains the queue ID: 36153 · Reply Quote

Penguin Send message Joined: 2 Apr 12 Posts: 6 Credit: 334,298 RAC: 0	Message 36185 - Posted: 1 Aug 2018, 20:33:36 UTC Same error msg here after a day of crunching. Happens on Win10 and Win7 with lastest vbox and I did downgrade to 5.1.30 on the win7 machine, testing it now. Win10 machine will be downgraded now too. Stopping boinc and restarting works, for about 4%, then stops again. Will see what happens with a downgrade on the win10 machine now. ID: 36185 · Reply Quote

nsandersen Send message Joined: 19 Sep 05 Posts: 3 Credit: 64,743 RAC: 0	Message 36697 - Posted: 13 Sep 2018, 8:12:08 UTC - in response to Message 35861. How do you filter out just the ATLAS tasks? (Indeed, these constantly seem to be "unmanageable" and "postponed" a long time, blocking other jobs.) Thank you. ID: 36697 · Reply Quote

bronco Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0	Message 36699 - Posted: 13 Sep 2018, 15:41:20 UTC - in response to Message 36697. Go to your project settings at https://lhcathome.cern.ch/lhcathome/prefs.php?subset=project&cols=1. There you will have up to 4 venues configured (default, home, work, school) or, depending on what you have configured in the past you might have only the default venue configured. For each of the venues you have configured click "edit" then deselect ALL of the following: ATLAS Run test applications If no work for selected applications is available, accept work from other applications Don't forget to scroll to the bottom of the page and click "Update preferences". ID: 36699 · Reply Quote

BelgianEnthousiast Send message Joined: 5 Apr 15 Posts: 18 Credit: 5,910,849 RAC: 0	Message 37768 - Posted: 16 Jan 2019, 10:15:16 UTC - in response to Message 36153. Hi Toby, All, I'm encountering the same issue for about 2 months now. (Postponed: VM job unmanageable, restarting later.) I run BOINC version 7.14.2 (x64), version 3.0.1. VB v6.0.0 r127566. I have 32 GB of RAM 6 core CPU with hyperthreading activated (12 virtual cores) 2 x Asus GTX 1070Ti Ample disk space, dedicated drive to BOINC of 3 TB with 2.5 TB still unused capacity. Win 10 Pro (10.017134), all latest patches installed. All drivers kept up-to-date (mobo, drives, LAN, bluetooth, WiFi, etc.). Latest up-to-date Anti-Virus & Firewall software. I usually see RAM usage at 12-15 GB out of 32, 15-20 GB available as reported by Win 10. I've been running multiple projects over time simultaneously (not all of them at once of course, but a select number of them at any one time). On GPU level I'm mainly running GPUGrid, in absence of GPUGrid, I have MilkyWay as backup (and Einstein, but only if MilkyWay is out of WU's) Each GPU card gets one virtual CPU attributed. On CPU level, I run LHC and as backup I have WorldComGrid, Rosetta, ClimatePrediction. (only one selected as backup at a time) I allow LHC to run virtually all projects : LHCb, Atlas, Theory, CMS, etc. The WU's run in 5-core setups and recently in 7-core setups. Total core usage : 2 for GPU's, 7 for CPU or 9 cores in total. Adding one core for BOINC and VirtualBox, I'm running on average 10 virtual cores out of 12. Apart from doing some mail or surfing on the internet, nothing else is happening on this system. Average CPU load is 80 % The system runs 24/7, 320/365 days for 4 years now. On LHC in total, I racked up 3.537.241 credits, I modestly think you could say I'm no longer a rookie :-). The odd thing is that this error kicks in also in the middle of the night, when I'm not active, subsequently, there should be ample space, processing power and RAM available. Unless of course I'm hacked and someone's using my rig to mine on cryptocurrencies... but didn't find any trace just yet. So far I only spotted Atlas WU's being affected. And it errors not only at the beginning of the WU, but at any possible time (e.g. I have one now at 81.47 %) The only way to unblock it, is to suspend all CPU & GPU activity, give BOINC the time to stop all crunching. Exit orderly from BOINC, restart BOINC and then it works again when all WU's are activated once more. Unfortunately, quite soon afterwards, it errors out again. Sometimes still the same WU, sometimes another one. LHC is the only project using VB, most likely, there might be some interaction causing an issue. The odd thing is that it seems to occur on both v6 from VB as well as v5.2.x Moreover, BOINC being at the very latest version does not seem to offer any alleviation. Actually, when checking VB, I get 2 instances which give an error "Inaccessible" : Runtime error opening 'F:\BOINC DATA\slots\2\boinc_a10398caff860c3b\boinc_a10398caff860c3b.vbox' for reading: -103 (Path not found.). F:\tinderbox\win-rel\src\VBox\Main\src-server\MachineImpl.cpp[745] (long __cdecl Machine::i_registeredInit(void)). Result Code: E_FAIL (0x80004005) Component: MachineWrap Interface: IMachine {5047460a-265d-4538-b23e-ddba5fb84976} Then I have a third instance which is "Powered Off" And a forth one which is Running (the current Atlas-7cores WU). Should I reinstall VB ? Any suggestions ? ID: 37768 · Reply Quote

Clandestinu Send message Joined: 17 Jun 15 Posts: 1 Credit: 401,631 RAC: 0	Message 37945 - Posted: 6 Feb 2019, 18:27:51 UTC Hi all and thanks in advance for any proposals When trying to start jobs from LHC@home , after some working time and arounf 30% done, the job stops saying "Reported.: VM job unmanageable , restarting later. " I'm working on an Gigabyte PC, 8 core Intel proc, 16 GO RAM. Don't know what to do with last updates Boinc and VBox.. With my apologies if disturbing you but I would like to understand… Many thanks again, have good days. ID: 37945 · Reply Quote

Jonathan Send message Joined: 25 Sep 17 Posts: 99 Credit: 3,261,384 RAC: 5,344	Message 37955 - Posted: 9 Feb 2019, 0:17:56 UTC - in response to Message 37945. I am not running any LHC tasks but I have recently come across this error running the Virtual Box tasks on cosmology@home. I had my system set to run 4 tasks with 2 cpus each and the error was occurring. I switched the tasks to a single task using 8 cpus and, so far the tasks are running fine. I have been running this way for at least 24 hours now. The errors started after I set no new tasks, returned all work units and upgraded to Virtual Box 6.0.4. Boinc is 7.14.2. I have SMT on but everything else should show for my computer info. Computer is the AMD processor one I don't know if this may help anyone with this error or not but I figured I would post it here. ID: 37955 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1280 Credit: 8,485,963 RAC: 1,693	Message 37956 - Posted: 9 Feb 2019, 8:49:40 UTC - in response to Message 37955. I don't know if this may help anyone with this error or not but I figured I would post it here. Thanks Jonathan! I also noticed that VirtualBox 6.0.2 and even more v6.0.4 is more sensible for the famous error "Postponed: VM job unmanageable, restarting later", caused by missing heartbeat by vboxwrapper. ID: 37956 · Reply Quote

Gary Send message Joined: 17 Apr 19 Posts: 2 Credit: 76,142 RAC: 0	Message 39065 - Posted: 6 Jun 2019, 7:12:10 UTC Last modified: 6 Jun 2019, 7:19:14 UTC Posting this again to this thread: There seems to be a specific problem with ATLAS's vboxwrapper executable where it loses control of the VM periodically: 5/7 such tasks eventually missed their deadline due to this on my machine (1 failed validation, 1 passed). Since this was just wasting slots/time for me, I disabled ATLAS tasks. Now I have a Theory Simulation VM task and it has not yet aborted due to the VM unmanageable issue and seems to be progressing and checkpointing (I hibernate the machine when not in use and the progress is not reseting). Hopefully it will finish. There have not been any schedule/policy changes other than blocking ATLAS tasks: now this new VM task seems to actually work and progress normally. This is with VirtualBox VM 6.0.8 (was using a version 5 before with the same issue). From the task properties and From looking at the files in C:\ProgramData\BOINC\projects\lhcathome.cern.ch_lhcathome : Theory is using vboxwrapper_26198ab7_windows_x86_64.exe ATLAS appears have been using vboxwrapper_26196_windows_x86_64.exe I assume that the major difference in the tasks is the .vdi used for each (I can see those in the same folder), so possibly this issue can be fixed (very easily?) by the developer updating the vboxwrapper for ATLAS. I think this has been noted before (maybe not in this thread though): is it too simple to fix the issue? ID: 39065 · Reply Quote

BelgianEnthousiast Send message Joined: 5 Apr 15 Posts: 18 Credit: 5,910,849 RAC: 0	Message 39093 - Posted: 10 Jun 2019, 6:00:24 UTC - in response to Message 35861. Hi Everyone, I'm experiencing exactly the same error with the Atlas project, none others. I have them on a quite frequent basis, pausing, exiting BOINC, rebooting and restarting again does the job. Not just on one, but now also on 2 machines. I'm running VBox 5.2.26 r 128414 (Qt5.6.2) BOINC Manager 7.14.2, Widgets 3.0.1 Both machines have 32 GB of RAM, swap file of 48 GB (1.5 x RAM size). Memory usage is peaking at 40 % of 32 GB (or around 13-14 GB of actual RAM usage). So by far not coming near even 50 % of the allowed maximum (setting of 80 % of RAM max usage for BOINC). It is quite annoying since it also interferes with other projects such as GPUGrid or WorldComGrid. When reading through the above discussion, there's mention that wrapper v5.2 seems to be the cause of the issues. Are there any plans to start to support it ? (Oracle wants me to upgrade to 5.2.30 in the mean time...) Thanks for any updates :-) BE. ID: 39093 · Reply Quote

Toby Broom Volunteer moderator Send message Joined: 27 Sep 08 Posts: 807 Credit: 652,067,873 RAC: 291,037	Message 39097 - Posted: 10 Jun 2019, 9:23:48 UTC - in response to Message 39093. Hi, 5.1.38 is reliable for me with ATLAS, all of the versions after that are not with 100% load, If I run my computer at 70% then I can use 6.0.x ID: 39097 · Reply Quote

BelgianEnthousiast Send message Joined: 5 Apr 15 Posts: 18 Credit: 5,910,849 RAC: 0	Message 39125 - Posted: 14 Jun 2019, 8:15:57 UTC - in response to Message 39097. Hi Toby, Thanks for the quick response ! I understood from above discussions that apparently v5.1 would work with Atlas, so your statement confirms that. Which is a little bit odd if I may say so : when downloading new version of BOINC + VBOX you automatically get 5.2 something. So, if I install 3 new PC's I have to install BOINC + VBOX first, then downgrade VBOX 5.1 on all 3 PC's afterwards. That's a bit of a hassle... Are there any plans in the relatively short term to make Atlas compatible with v5.2 ? By the way, Oracle is already pushing to install v6. smth in the meantime... Or is it the purpuse to leapfrog v5.2 and support straight away v6 ? Wish you a nice weekend ! Friendly Greetings, K. ID: 39125 · Reply Quote

LHC@home