Message boards : ATLAS application : Atlas tasks "Postponed: VM job unmanageable, restarting later."
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3

AuthorMessage
bronco

Send message
Joined: 13 Apr 18
Posts: 443
Credit: 8,438,885
RAC: 0
Message 35923 - Posted: 15 Jul 2018, 21:40:14 UTC - in response to Message 35911.  

3. a script that periodically (once per minute) checks for recently started VMs/vboxwrappers and renices their nice level.


If that script is not too long could you post it here, maybe in a separate thread? Or PM it to me? I would love to have a look at it and maybe incorporate it into the LHC babysitter script I've been working on for these past months.
ID: 35923 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2530
Credit: 253,722,201
RAC: 51,175
Message 35928 - Posted: 16 Jul 2018, 4:15:46 UTC - in response to Message 35923.  

If that script is not too long ...

It's just 1 line in /etc/crontab
* * * * *	root for VWRAP in $(pgrep -u boinc -l| egrep -i 'wrapper' |cut -d" " -f1); do renice -n 10 $VWRAP; done >/dev/null 2>&1
ID: 35928 · Report as offensive     Reply Quote
bronco

Send message
Joined: 13 Apr 18
Posts: 443
Credit: 8,438,885
RAC: 0
Message 35940 - Posted: 16 Jul 2018, 13:41:34 UTC - in response to Message 35928.  

bash is best
ID: 35940 · Report as offensive     Reply Quote
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 846
Credit: 691,128,132
RAC: 110,275
Message 35950 - Posted: 17 Jul 2018, 20:11:42 UTC - in response to Message 35911.  
Last modified: 17 Jul 2018, 21:36:02 UTC

it would appear that <process_priority_special> doesn't do anything for this project, the wrapper still runs at low which is contrary to the documentation. the normal process priorty boost it up as documented.

not off to a good start, with all processes running as normal then I still got some unmanageable after a few hours but wiil give some time.
ID: 35950 · Report as offensive     Reply Quote
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 846
Credit: 691,128,132
RAC: 110,275
Message 35952 - Posted: 18 Jul 2018, 16:38:45 UTC - in response to Message 35950.  

it's maybe better with the change in Prio but still not 0 so same impact with no more tasks as it's blocks the queue.

Given this I quit my work for ATLAS project as the effort to babysit the project is too much, with the poor scheduling, ram configuration and now this.

If there is some changes then I will try again in the future.
ID: 35952 · Report as offensive     Reply Quote
Profile adrianxw

Send message
Joined: 29 Sep 04
Posts: 187
Credit: 705,487
RAC: 0
Message 36002 - Posted: 23 Jul 2018, 7:14:44 UTC - in response to Message 35952.  

I just got back yesterday, hence delay. I have not seen anything that addresses the issue that I see and am unhappy with. I have detatched my machines from LHC.

Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 36002 · Report as offensive     Reply Quote
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 846
Credit: 691,128,132
RAC: 110,275
Message 36149 - Posted: 31 Jul 2018, 15:46:29 UTC

Since I'm on windows, I made a Powershell script that runs and aborts the stuck tasks automatically.

I downgraded to the 5.1.x Vbox and this improved things as well I think.
ID: 36149 · Report as offensive     Reply Quote
bronco

Send message
Joined: 13 Apr 18
Posts: 443
Credit: 8,438,885
RAC: 0
Message 36152 - Posted: 31 Jul 2018, 17:45:29 UTC - in response to Message 36149.  

Since I'm on windows, I made a Powershell script that runs and aborts the stuck tasks automatically.

Then you run the risk of a scenario where the host is doing only 1 of 2 things: it's either 1) aborting a stuck task and downloading another one OR 2) in the process of turning another viable task into a stuck task. It's a vicious cycle... get a new task, hope to get lucky, blow it, get a new task, hope to get lucky, blow it, get a new task, hope to get lucky, blow it, and on and on and on.

I downgraded to the 5.1.x Vbox and this improved things as well I think.

We all want to believe that our shiny new box with the CPU we've been drooling over for months coupled with more RAM than we've ever crammed onto a mobo before, fast drives, etc. can do anything we throw at it. Then reality sets in. And what do we do? Human nature is to deny and blame everybody but ourselves. I've been caught up in the "oh, it's gotta be a flaky version of this or that app" game before and sometimes it is. More often than not I am simply expecting too much of a rig I thought was absolutely invincible but instead we torture ourselves with weeks of re-installing this, rolling back to that, searching for the magic combination of X, Y and Z and other fruitless endeavor.

Not sure how many cores you're trying to run ATLAS on but sometimes we get more done by simply trying to do less than the max and doing that well. Sometimes we need to try doing just the minimum until we can do that each and every time consistently and only once we reach that level should we kick it up a small notch.
ID: 36152 · Report as offensive     Reply Quote
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 846
Credit: 691,128,132
RAC: 110,275
Message 36153 - Posted: 31 Jul 2018, 18:51:31 UTC
Last modified: 31 Jul 2018, 20:48:46 UTC

I guess my previous actions were the best. As if we don't try and work around the issues and stop working for the project then they would address the request that we have.

I only run one core task on my machines as it the most efficient, I have one that run dual core as there is a limit on the number of tasks that you can run. compared to the other projects I don't contribute much to ATLAS as it's not possible to configure my computers to do more when shared with the other projects.

I expected it to work well as it was working well before so that set my expectations, hence the roll back.

If the task is postponed then in my experience it's never come back from that state, I think if you restart Boinc, then it will come back. The biggest problem with that state is it stops boinc from getting new tasks so when left un-managed it drains the queue
ID: 36153 · Report as offensive     Reply Quote
Penguin

Send message
Joined: 2 Apr 12
Posts: 6
Credit: 334,298
RAC: 0
Message 36185 - Posted: 1 Aug 2018, 20:33:36 UTC

Same error msg here after a day of crunching. Happens on Win10 and Win7 with lastest vbox and I did downgrade to 5.1.30 on the win7 machine, testing it now. Win10 machine will be downgraded now too. Stopping boinc and restarting works, for about 4%, then stops again. Will see what happens with a downgrade on the win10 machine now.
ID: 36185 · Report as offensive     Reply Quote
nsandersen

Send message
Joined: 19 Sep 05
Posts: 3
Credit: 64,743
RAC: 0
Message 36697 - Posted: 13 Sep 2018, 8:12:08 UTC - in response to Message 35861.  

How do you filter out just the ATLAS tasks?
(Indeed, these constantly seem to be "unmanageable" and "postponed" a long time, blocking other jobs.)
Thank you.
ID: 36697 · Report as offensive     Reply Quote
bronco

Send message
Joined: 13 Apr 18
Posts: 443
Credit: 8,438,885
RAC: 0
Message 36699 - Posted: 13 Sep 2018, 15:41:20 UTC - in response to Message 36697.  

Go to your project settings at https://lhcathome.cern.ch/lhcathome/prefs.php?subset=project&cols=1. There you will have up to 4 venues configured (default, home, work, school) or, depending on what you have configured in the past you might have only the default venue configured. For each of the venues you have configured click "edit" then deselect ALL of the following:

    ATLAS
    Run test applications
    If no work for selected applications is available, accept work from other applications



Don't forget to scroll to the bottom of the page and click "Update preferences".

ID: 36699 · Report as offensive     Reply Quote
BelgianEnthousiast

Send message
Joined: 5 Apr 15
Posts: 18
Credit: 5,910,849
RAC: 0
Message 37768 - Posted: 16 Jan 2019, 10:15:16 UTC - in response to Message 36153.  

Hi Toby, All,

I'm encountering the same issue for about 2 months now. (Postponed: VM job unmanageable, restarting later.)

I run BOINC version 7.14.2 (x64), version 3.0.1.
VB v6.0.0 r127566.

I have 32 GB of RAM
6 core CPU with hyperthreading activated (12 virtual cores)
2 x Asus GTX 1070Ti
Ample disk space, dedicated drive to BOINC of 3 TB with 2.5 TB still unused capacity.
Win 10 Pro (10.017134), all latest patches installed. All drivers kept up-to-date (mobo, drives, LAN, bluetooth, WiFi, etc.).
Latest up-to-date Anti-Virus & Firewall software.

I usually see RAM usage at 12-15 GB out of 32, 15-20 GB available as reported by Win 10.

I've been running multiple projects over time simultaneously (not all of them at once of course, but a select number of them at any
one time).
On GPU level I'm mainly running GPUGrid, in absence of GPUGrid, I have MilkyWay as backup (and Einstein, but only if
MilkyWay is out of WU's)
Each GPU card gets one virtual CPU attributed.

On CPU level, I run LHC and as backup I have WorldComGrid, Rosetta, ClimatePrediction. (only one selected as backup at a time)

I allow LHC to run virtually all projects : LHCb, Atlas, Theory, CMS, etc.

The WU's run in 5-core setups and recently in 7-core setups.

Total core usage : 2 for GPU's, 7 for CPU or 9 cores in total. Adding one core for BOINC and VirtualBox, I'm running on average 10
virtual cores out of 12.

Apart from doing some mail or surfing on the internet, nothing else is happening on this system.
Average CPU load is 80 %
The system runs 24/7, 320/365 days for 4 years now.

On LHC in total, I racked up 3.537.241 credits, I modestly think you could say I'm no longer a rookie :-).

The odd thing is that this error kicks in also in the middle of the night, when I'm not active, subsequently, there should be ample space, processing
power and RAM available.
Unless of course I'm hacked and someone's using my rig to mine on cryptocurrencies... but didn't find any trace just yet.

So far I only spotted Atlas WU's being affected. And it errors not only at the beginning of the WU, but at any possible time (e.g. I have one now at 81.47 %)

The only way to unblock it, is to suspend all CPU & GPU activity, give BOINC the time to stop all crunching. Exit orderly from BOINC, restart BOINC
and then it works again when all WU's are activated once more. Unfortunately, quite soon afterwards, it errors out again. Sometimes still the same
WU, sometimes another one.

LHC is the only project using VB, most likely, there might be some interaction causing an issue. The odd thing is that it seems to occur on both
v6 from VB as well as v5.2.x Moreover, BOINC being at the very latest version does not seem to offer any alleviation.

Actually, when checking VB, I get 2 instances which give an error "Inaccessible" :
Runtime error opening 'F:\BOINC DATA\slots\2\boinc_a10398caff860c3b\boinc_a10398caff860c3b.vbox' for reading: -103 (Path not found.).
F:\tinderbox\win-rel\src\VBox\Main\src-server\MachineImpl.cpp[745] (long __cdecl Machine::i_registeredInit(void)).
Result Code: E_FAIL (0x80004005)
Component: MachineWrap
Interface: IMachine {5047460a-265d-4538-b23e-ddba5fb84976}

Then I have a third instance which is "Powered Off"

And a forth one which is Running (the current Atlas-7cores WU).

Should I reinstall VB ?

Any suggestions ?
ID: 37768 · Report as offensive     Reply Quote
Clandestinu

Send message
Joined: 17 Jun 15
Posts: 1
Credit: 401,633
RAC: 0
Message 37945 - Posted: 6 Feb 2019, 18:27:51 UTC

Hi all and thanks in advance for any proposals

When trying to start jobs from LHC@home , after some working time and arounf 30% done, the job stops saying "Reported.: VM job unmanageable , restarting later. "
I'm working on an Gigabyte PC, 8 core Intel proc, 16 GO RAM. Don't know what to do with last updates Boinc and VBox..
With my apologies if disturbing you but I would like to understand…
Many thanks again, have good days.
ID: 37945 · Report as offensive     Reply Quote
Jonathan

Send message
Joined: 25 Sep 17
Posts: 99
Credit: 3,425,566
RAC: 0
Message 37955 - Posted: 9 Feb 2019, 0:17:56 UTC - in response to Message 37945.  

I am not running any LHC tasks but I have recently come across this error running the Virtual Box tasks on cosmology@home.

I had my system set to run 4 tasks with 2 cpus each and the error was occurring. I switched the tasks to a single task using 8 cpus and, so far
the tasks are running fine. I have been running this way for at least 24 hours now.

The errors started after I set no new tasks, returned all work units and upgraded to Virtual Box 6.0.4. Boinc is 7.14.2. I have SMT on but
everything else should show for my computer info. Computer is the AMD processor one

I don't know if this may help anyone with this error or not but I figured I would post it here.
ID: 37955 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1417
Credit: 9,441,018
RAC: 1,047
Message 37956 - Posted: 9 Feb 2019, 8:49:40 UTC - in response to Message 37955.  

I don't know if this may help anyone with this error or not but I figured I would post it here.

Thanks Jonathan!

I also noticed that VirtualBox 6.0.2 and even more v6.0.4 is more sensible for the famous error "Postponed: VM job unmanageable, restarting later",
caused by missing heartbeat by vboxwrapper.
ID: 37956 · Report as offensive     Reply Quote
Gary

Send message
Joined: 17 Apr 19
Posts: 2
Credit: 76,142
RAC: 0
Message 39065 - Posted: 6 Jun 2019, 7:12:10 UTC
Last modified: 6 Jun 2019, 7:19:14 UTC

Posting this again to this thread:

There seems to be a specific problem with ATLAS's vboxwrapper executable where it loses control of the VM periodically: 5/7 such tasks eventually missed their deadline due to this on my machine (1 failed validation, 1 passed). Since this was just wasting slots/time for me, I disabled ATLAS tasks. Now I have a Theory Simulation VM task and it has not yet aborted due to the VM unmanageable issue and seems to be progressing and checkpointing (I hibernate the machine when not in use and the progress is not reseting). Hopefully it will finish.

There have not been any schedule/policy changes other than blocking ATLAS tasks: now this new VM task seems to actually work and progress normally. This is with VirtualBox VM 6.0.8 (was using a version 5 before with the same issue).

From the task properties and From looking at the files in C:\ProgramData\BOINC\projects\lhcathome.cern.ch_lhcathome :
Theory is using vboxwrapper_26198ab7_windows_x86_64.exe
ATLAS appears have been using vboxwrapper_26196_windows_x86_64.exe

I assume that the major difference in the tasks is the .vdi used for each (I can see those in the same folder), so possibly this issue can be fixed (very easily?) by the developer updating the vboxwrapper for ATLAS.

I think this has been noted before (maybe not in this thread though): is it too simple to fix the issue?
ID: 39065 · Report as offensive     Reply Quote
BelgianEnthousiast

Send message
Joined: 5 Apr 15
Posts: 18
Credit: 5,910,849
RAC: 0
Message 39093 - Posted: 10 Jun 2019, 6:00:24 UTC - in response to Message 35861.  

Hi Everyone,

I'm experiencing exactly the same error with the Atlas project, none others.

I have them on a quite frequent basis, pausing, exiting BOINC, rebooting and restarting again
does the job. Not just on one, but now also on 2 machines.

I'm running VBox 5.2.26 r 128414 (Qt5.6.2)
BOINC Manager 7.14.2, Widgets 3.0.1

Both machines have 32 GB of RAM, swap file of 48 GB (1.5 x RAM size).
Memory usage is peaking at 40 % of 32 GB (or around 13-14 GB of actual RAM usage).
So by far not coming near even 50 % of the allowed maximum (setting of 80 % of RAM max usage for BOINC).

It is quite annoying since it also interferes with other projects such as GPUGrid or WorldComGrid.

When reading through the above discussion, there's mention that wrapper v5.2 seems to be the cause of the issues.
Are there any plans to start to support it ? (Oracle wants me to upgrade to 5.2.30 in the mean time...)

Thanks for any updates :-)

BE.
ID: 39093 · Report as offensive     Reply Quote
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 846
Credit: 691,128,132
RAC: 110,275
Message 39097 - Posted: 10 Jun 2019, 9:23:48 UTC - in response to Message 39093.  

Hi, 5.1.38 is reliable for me with ATLAS, all of the versions after that are not with 100% load, If I run my computer at 70% then I can use 6.0.x
ID: 39097 · Report as offensive     Reply Quote
BelgianEnthousiast

Send message
Joined: 5 Apr 15
Posts: 18
Credit: 5,910,849
RAC: 0
Message 39125 - Posted: 14 Jun 2019, 8:15:57 UTC - in response to Message 39097.  

Hi Toby,

Thanks for the quick response !

I understood from above discussions that apparently v5.1 would work with Atlas, so your
statement confirms that.
Which is a little bit odd if I may say so : when downloading new version of BOINC + VBOX
you automatically get 5.2 something.

So, if I install 3 new PC's I have to install BOINC + VBOX first, then downgrade VBOX 5.1
on all 3 PC's afterwards. That's a bit of a hassle...

Are there any plans in the relatively short term to make Atlas compatible with v5.2 ?

By the way, Oracle is already pushing to install v6. smth in the meantime...
Or is it the purpuse to leapfrog v5.2 and support straight away v6 ?

Wish you a nice weekend !

Friendly Greetings,
K.
ID: 39125 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3

Message boards : ATLAS application : Atlas tasks "Postponed: VM job unmanageable, restarting later."


©2024 CERN