Message boards : ATLAS application : Atlas tasks "Postponed: VM job unmanageable, restarting later."
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · Next

AuthorMessage
Profile adrianxw

Send message
Joined: 29 Sep 04
Posts: 187
Credit: 705,487
RAC: 0
Message 35864 - Posted: 12 Jul 2018, 4:24:51 UTC

No new tasks set again. I just checked my settings again, I have clearly said I do not want ATLAS tasks, yet, look at my work done, it iIS sending me ATLAS.

Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 35864 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2500
Credit: 248,633,844
RAC: 126,444
Message 35865 - Posted: 12 Jul 2018, 5:22:59 UTC - in response to Message 35864.  

Which venue (none/default, home, school, work) did you assign to your computers?
Did you edit exactly that venue?

Everybody would say: "Yes, of course! I wrote it here and there!", nonetheless you may check it again very careful.


The logs point out that your computers are very busy.
Hard to say why. It may be just 1 single component, a driver or a setting.
If you don't identify the bottleneck, you may sooner or later get further crashes.

If you are patient enough, you may run only Theory tasks for a while:
- 1-core setup
- start with not more than 4 of them on your 8 core computers
- don't start them concurrently to avoid a disk bottleneck

Let them finish and examine the logs.
If they are OK, raise the number of concurrently running VMs step by step.
At which number do the errors return?
Stay below that number and test other apps.
ID: 35865 · Report as offensive     Reply Quote
Profile adrianxw

Send message
Joined: 29 Sep 04
Posts: 187
Credit: 705,487
RAC: 0
Message 35866 - Posted: 12 Jul 2018, 6:40:00 UTC
Last modified: 12 Jul 2018, 7:28:53 UTC

I just checked to be sure, but it is as I thought, I do not have seperate home/work/school setup anymore. When I still worked, I did, but now, all my machines are here, in this room and run default.

Yes, my machines are busy, they have 4GHz i7's and run 24/7 and are connected to about 10 projects.

Sure, I can jump through a load of hoops, and fiddle with things. The action I took was to say no to ATLAS, it sent me ATLAS. That is a project server action.

Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 35866 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2500
Credit: 248,633,844
RAC: 126,444
Message 35868 - Posted: 12 Jul 2018, 7:44:46 UTC - in response to Message 35866.  

I just checked to be sure, but it is as I thought, I do not have seperate home/work/school setup anymore. When I still worked, I did, but now, all my machines are here, in this room and run default.

You may navigate to the webpage showing your computer details and reassign it again to "default".
If this doesn't help, you may reattach your computer with a fresh computer ID.

To do this:
1. detach from the project
2. edit cc_config.xml in your client folder and insert the line <ncpus>x</ncpus> there.
x has to be a number that is different from the number of cores you currently use (most likely 8).
3. reload your configuration files
4. reattach to the project
5. check your local client messages. There must be a message from the server that your computer got a fresh ID.
6. remove <ncpus>x</ncpus> from cc_config.xml
7. reload your configuration files

Yes, my machines are busy, they have 4GHz i7's and run 24/7 and are connected to about 10 projects.

How many tasks (all projects) do you run concurrently?
Depending on the overall project mix many 8 core computers are already saturated when they run 5-7 tasks concurrently.
ID: 35868 · Report as offensive     Reply Quote
Profile adrianxw

Send message
Joined: 29 Sep 04
Posts: 187
Credit: 705,487
RAC: 0
Message 35870 - Posted: 12 Jul 2018, 9:04:36 UTC

I have not used the home/work/school classes since I was working, that ended in 2009, (the voluntary work I've done on and off since then would not have had appropriate facilities to make it worth setting up again). I have rebuilt these machines since then, more than once, so they will never have been assigned a work class, they would be default since BOINC was installed.

The machines run 24/7 and I do not limit BOINC in anyway, so all cores/GPU's can be busy at all times. I therefors see 8 tasks running, sometimes 9 because Milkyway does not use a whole CPU so another project can grab a few cycles here and there.

I am not totally convinved BOINC handles multithreading correctly, it should, but I've seen a few unexplained things which leave me doubts. I dropped one project because of its enthusiasm for spinning threads, I don't remember which it was.

Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 35870 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2500
Credit: 248,633,844
RAC: 126,444
Message 35871 - Posted: 12 Jul 2018, 9:39:49 UTC - in response to Message 35870.  

The machines run 24/7 and I do not limit BOINC in anyway, so all cores/GPU's can be busy at all times. I therefors see 8 tasks running, sometimes 9 ...

This sounds like it's your objective to keep all CPU cores plus your GPU permanently under full load.
Sorry, but in this case I'm out as I have no idea how to setup BOINC and the vbox apps of this project to run error-free in such a scenario.
ID: 35871 · Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 1156
Credit: 52,785,125
RAC: 62,721
Message 35872 - Posted: 12 Jul 2018, 9:51:18 UTC - in response to Message 35871.  

I have done that with Einstein GPU's and LHC multi-core tasks for years with no problems on my older quad-core and my newer 8-core with no problems (Theory multicore here and LHCb test multi-cores)

You have just enough ram on your two 8-core pc's to run all 8 cores of Theory multi-cores but not enough to do that with Atlas

I am actually doing a test right now on a few 8-cores that are running 4 X2 Theory tasks and one LHCb task here at the same time (9 tasks for some reason)

But I may only do one batch since the only complete one so far has about 18hrs running but only just over 5mins of CPU time and was Valid
ID: 35872 · Report as offensive     Reply Quote
Profile adrianxw

Send message
Joined: 29 Sep 04
Posts: 187
Credit: 705,487
RAC: 0
Message 35873 - Posted: 12 Jul 2018, 10:36:39 UTC

The real point or question or complaint I have right now is simply, why did it download and run an ATLAS when I have specifically told it not to do so. It suggests that either the mechanism for selecting work types is not working correctly, or that it is being ignored as the project desires.

I check from time to time, usually I don't see more than 66% of the available RAM in use, 2GB per thread should be adequate with a spread project base. RAMMap invariably shows a chunk of unused.

Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 35873 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Nov 14
Posts: 602
Credit: 24,371,321
RAC: 0
Message 35874 - Posted: 12 Jul 2018, 13:05:13 UTC - in response to Message 35873.  

The real point or question or complaint I have right now is simply, why did it download and run an ATLAS when I have specifically told it not to do so. It suggests that either the mechanism for selecting work types is not working correctly, or that it is being ignored as the project desires.

I have seen a similar issue, though I don't know that it is exactly the same. But I can't allow other tasks if I don't want to run ATLAS at all.
https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4712
ID: 35874 · Report as offensive     Reply Quote
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 817
Credit: 684,058,907
RAC: 136,472
Message 35877 - Posted: 12 Jul 2018, 18:06:01 UTC

I see about 1/day on my machines, sad thing is it worked fine with 5.1.x branch
ID: 35877 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 710
Credit: 47,460,432
RAC: 29,226
Message 35878 - Posted: 12 Jul 2018, 18:08:36 UTC - in response to Message 35873.  

The real point or question or complaint I have right now is simply, why did it download and run an ATLAS when I have specifically told it not to do so. It suggests that either the mechanism for selecting work types is not working correctly, or that it is being ignored as the project desires.

I check from time to time, usually I don't see more than 66% of the available RAM in use, 2GB per thread should be adequate with a spread project base. RAMMap invariably shows a chunk of unused.

How is your setting for this "If no work for selected applications is available, accept work from other applications?" You should set is as "no".
ID: 35878 · Report as offensive     Reply Quote
Profile adrianxw

Send message
Joined: 29 Sep 04
Posts: 187
Credit: 705,487
RAC: 0
Message 35879 - Posted: 12 Jul 2018, 18:20:07 UTC
Last modified: 12 Jul 2018, 19:12:28 UTC

YES! Great thought, it was set, I have just unset it. It is so rare that I have fiddled with that, it had gone out of my view finder. I guess I expect the other jobs to be from subprojects that are enabled. There is a possible bug here. I'll send a note to BOINC.

Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 35879 · Report as offensive     Reply Quote
bronco

Send message
Joined: 13 Apr 18
Posts: 443
Credit: 8,438,885
RAC: 0
Message 35880 - Posted: 12 Jul 2018, 21:11:11 UTC - in response to Message 35879.  

I guess I expect the other jobs to be from subprojects that are enabled. There is a possible bug here. I'll send a note to BOINC.


So you're saying that if there are no tasks for the applications you have selected then it should send you a task from one of the applications you have selected. But it's been determined that there are no tasks for the apps you selected so it cannot send one of those. Maybe I'm confused but if that's the way it should work then there is no logical reason to have that option.

Seems to me it's working exactly as it should (no bug). You did not have ATLAS selected and there were no tasks for apps you selected so it sent you one from those you de-selected. Am I missing something?
ID: 35880 · Report as offensive     Reply Quote
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 817
Credit: 684,058,907
RAC: 136,472
Message 35891 - Posted: 14 Jul 2018, 9:56:50 UTC

I can't stand it with 5.2, I get just over 50% failure rate
ID: 35891 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2190
Credit: 173,343,512
RAC: 62,427
Message 35892 - Posted: 14 Jul 2018, 10:32:01 UTC - in response to Message 35891.  
Last modified: 14 Jul 2018, 10:34:28 UTC

Toby,
saw this:
2018-07-14 06:03:42 (12748): ERROR: Vboxwrapper lost communication with VirtualBox, rescheduling task for a later time.
Have upgraded to 5.2.14. Never seen after this upgrade!
Edit: vboxsvc.exe changed in Taskmanager to lower priority than normal, too.
ID: 35892 · Report as offensive     Reply Quote
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 817
Credit: 684,058,907
RAC: 136,472
Message 35893 - Posted: 14 Jul 2018, 10:42:37 UTC - in response to Message 35892.  

OK, I try .14 and see if things improve 1st
ID: 35893 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1374
Credit: 9,159,123
RAC: 5,192
Message 35894 - Posted: 14 Jul 2018, 17:59:06 UTC - in response to Message 35892.  

...
Edit: vboxsvc.exe changed in Taskmanager to lower priority than normal, too.

That's the most important change for preventing the "Postponed tasks..."
Increasing priority of vboxwrapper.exe to the same priority as VBoxSVC.exe or higher will also help a bit.
ID: 35894 · Report as offensive     Reply Quote
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 817
Credit: 684,058,907
RAC: 136,472
Message 35907 - Posted: 15 Jul 2018, 9:37:13 UTC

All of my vbox process's are normal and the wrapper is a mix of low and below normal.

I assume ATLAS is troublesome as it has the 26196 wrapper and the other projects use the 29198?

Changing the wrapper priority is a management headache as these are created on task launch so needs constant attention?

I don't seem much change with .14 but normally takes a few day to flush out any issues.
ID: 35907 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2500
Credit: 248,633,844
RAC: 126,444
Message 35911 - Posted: 15 Jul 2018, 10:13:58 UTC - in response to Message 35907.  

Don't know if all of the following methods work on windows and how much effort it is to implement them.
On my linux boxes I use a combination of:

1. starting the BOINC client explicitely with a distinct nice level

2. using <no_priority_change>, <process_priority> and <process_priority_special> in cc_config.xml.
An explanation can be found here.

3. a script that periodically (once per minute) checks for recently started VMs/vboxwrappers and renices their nice level.


To get a stable setting the configured values can be different for each computer, so lots of testing is necessary.
ID: 35911 · Report as offensive     Reply Quote
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 817
Credit: 684,058,907
RAC: 136,472
Message 35913 - Posted: 15 Jul 2018, 11:22:35 UTC - in response to Message 35911.  

Will give a try with the prio settings, this is better as they will just launch with correct settings.
ID: 35913 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · Next

Message boards : ATLAS application : Atlas tasks "Postponed: VM job unmanageable, restarting later."


©2024 CERN