Message boards :
ATLAS application :
Atlas tasks "Postponed: VM job unmanageable, restarting later."
Message board moderation
Previous · 1 · 2 · 3 · Next
Author | Message |
---|---|
Send message Joined: 29 Sep 04 Posts: 187 Credit: 705,487 RAC: 0 |
No new tasks set again. I just checked my settings again, I have clearly said I do not want ATLAS tasks, yet, look at my work done, it iIS sending me ATLAS. Wave upon wave of demented avengers march cheerfully out of obscurity into the dream. |
Send message Joined: 15 Jun 08 Posts: 2500 Credit: 248,633,844 RAC: 126,444 |
Which venue (none/default, home, school, work) did you assign to your computers? Did you edit exactly that venue? Everybody would say: "Yes, of course! I wrote it here and there!", nonetheless you may check it again very careful. The logs point out that your computers are very busy. Hard to say why. It may be just 1 single component, a driver or a setting. If you don't identify the bottleneck, you may sooner or later get further crashes. If you are patient enough, you may run only Theory tasks for a while: - 1-core setup - start with not more than 4 of them on your 8 core computers - don't start them concurrently to avoid a disk bottleneck Let them finish and examine the logs. If they are OK, raise the number of concurrently running VMs step by step. At which number do the errors return? Stay below that number and test other apps. |
Send message Joined: 29 Sep 04 Posts: 187 Credit: 705,487 RAC: 0 |
I just checked to be sure, but it is as I thought, I do not have seperate home/work/school setup anymore. When I still worked, I did, but now, all my machines are here, in this room and run default. Yes, my machines are busy, they have 4GHz i7's and run 24/7 and are connected to about 10 projects. Sure, I can jump through a load of hoops, and fiddle with things. The action I took was to say no to ATLAS, it sent me ATLAS. That is a project server action. Wave upon wave of demented avengers march cheerfully out of obscurity into the dream. |
Send message Joined: 15 Jun 08 Posts: 2500 Credit: 248,633,844 RAC: 126,444 |
I just checked to be sure, but it is as I thought, I do not have seperate home/work/school setup anymore. When I still worked, I did, but now, all my machines are here, in this room and run default. You may navigate to the webpage showing your computer details and reassign it again to "default". If this doesn't help, you may reattach your computer with a fresh computer ID. To do this: 1. detach from the project 2. edit cc_config.xml in your client folder and insert the line <ncpus>x</ncpus> there. x has to be a number that is different from the number of cores you currently use (most likely 8). 3. reload your configuration files 4. reattach to the project 5. check your local client messages. There must be a message from the server that your computer got a fresh ID. 6. remove <ncpus>x</ncpus> from cc_config.xml 7. reload your configuration files Yes, my machines are busy, they have 4GHz i7's and run 24/7 and are connected to about 10 projects. How many tasks (all projects) do you run concurrently? Depending on the overall project mix many 8 core computers are already saturated when they run 5-7 tasks concurrently. |
Send message Joined: 29 Sep 04 Posts: 187 Credit: 705,487 RAC: 0 |
I have not used the home/work/school classes since I was working, that ended in 2009, (the voluntary work I've done on and off since then would not have had appropriate facilities to make it worth setting up again). I have rebuilt these machines since then, more than once, so they will never have been assigned a work class, they would be default since BOINC was installed. The machines run 24/7 and I do not limit BOINC in anyway, so all cores/GPU's can be busy at all times. I therefors see 8 tasks running, sometimes 9 because Milkyway does not use a whole CPU so another project can grab a few cycles here and there. I am not totally convinved BOINC handles multithreading correctly, it should, but I've seen a few unexplained things which leave me doubts. I dropped one project because of its enthusiasm for spinning threads, I don't remember which it was. Wave upon wave of demented avengers march cheerfully out of obscurity into the dream. |
Send message Joined: 15 Jun 08 Posts: 2500 Credit: 248,633,844 RAC: 126,444 |
The machines run 24/7 and I do not limit BOINC in anyway, so all cores/GPU's can be busy at all times. I therefors see 8 tasks running, sometimes 9 ... This sounds like it's your objective to keep all CPU cores plus your GPU permanently under full load. Sorry, but in this case I'm out as I have no idea how to setup BOINC and the vbox apps of this project to run error-free in such a scenario. |
Send message Joined: 24 Oct 04 Posts: 1156 Credit: 52,785,125 RAC: 62,721 |
I have done that with Einstein GPU's and LHC multi-core tasks for years with no problems on my older quad-core and my newer 8-core with no problems (Theory multicore here and LHCb test multi-cores) You have just enough ram on your two 8-core pc's to run all 8 cores of Theory multi-cores but not enough to do that with Atlas I am actually doing a test right now on a few 8-cores that are running 4 X2 Theory tasks and one LHCb task here at the same time (9 tasks for some reason) But I may only do one batch since the only complete one so far has about 18hrs running but only just over 5mins of CPU time and was Valid |
Send message Joined: 29 Sep 04 Posts: 187 Credit: 705,487 RAC: 0 |
The real point or question or complaint I have right now is simply, why did it download and run an ATLAS when I have specifically told it not to do so. It suggests that either the mechanism for selecting work types is not working correctly, or that it is being ignored as the project desires. I check from time to time, usually I don't see more than 66% of the available RAM in use, 2GB per thread should be adequate with a spread project base. RAMMap invariably shows a chunk of unused. Wave upon wave of demented avengers march cheerfully out of obscurity into the dream. |
Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0 |
The real point or question or complaint I have right now is simply, why did it download and run an ATLAS when I have specifically told it not to do so. It suggests that either the mechanism for selecting work types is not working correctly, or that it is being ignored as the project desires. I have seen a similar issue, though I don't know that it is exactly the same. But I can't allow other tasks if I don't want to run ATLAS at all. https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4712 |
Send message Joined: 27 Sep 08 Posts: 817 Credit: 684,058,907 RAC: 136,472 |
I see about 1/day on my machines, sad thing is it worked fine with 5.1.x branch |
Send message Joined: 28 Sep 04 Posts: 710 Credit: 47,460,432 RAC: 29,226 |
The real point or question or complaint I have right now is simply, why did it download and run an ATLAS when I have specifically told it not to do so. It suggests that either the mechanism for selecting work types is not working correctly, or that it is being ignored as the project desires. How is your setting for this "If no work for selected applications is available, accept work from other applications?" You should set is as "no". |
Send message Joined: 29 Sep 04 Posts: 187 Credit: 705,487 RAC: 0 |
YES! Great thought, it was set, I have just unset it. It is so rare that I have fiddled with that, it had gone out of my view finder. I guess I expect the other jobs to be from subprojects that are enabled. There is a possible bug here. I'll send a note to BOINC. Wave upon wave of demented avengers march cheerfully out of obscurity into the dream. |
Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0 |
I guess I expect the other jobs to be from subprojects that are enabled. There is a possible bug here. I'll send a note to BOINC. So you're saying that if there are no tasks for the applications you have selected then it should send you a task from one of the applications you have selected. But it's been determined that there are no tasks for the apps you selected so it cannot send one of those. Maybe I'm confused but if that's the way it should work then there is no logical reason to have that option. Seems to me it's working exactly as it should (no bug). You did not have ATLAS selected and there were no tasks for apps you selected so it sent you one from those you de-selected. Am I missing something? |
Send message Joined: 27 Sep 08 Posts: 817 Credit: 684,058,907 RAC: 136,472 |
I can't stand it with 5.2, I get just over 50% failure rate |
Send message Joined: 2 May 07 Posts: 2190 Credit: 173,343,512 RAC: 62,427 |
Toby, saw this: 2018-07-14 06:03:42 (12748): ERROR: Vboxwrapper lost communication with VirtualBox, rescheduling task for a later time. Have upgraded to 5.2.14. Never seen after this upgrade! Edit: vboxsvc.exe changed in Taskmanager to lower priority than normal, too. |
Send message Joined: 27 Sep 08 Posts: 817 Credit: 684,058,907 RAC: 136,472 |
OK, I try .14 and see if things improve 1st |
Send message Joined: 14 Jan 10 Posts: 1374 Credit: 9,159,123 RAC: 5,192 |
... That's the most important change for preventing the "Postponed tasks..." Increasing priority of vboxwrapper.exe to the same priority as VBoxSVC.exe or higher will also help a bit. |
Send message Joined: 27 Sep 08 Posts: 817 Credit: 684,058,907 RAC: 136,472 |
All of my vbox process's are normal and the wrapper is a mix of low and below normal. I assume ATLAS is troublesome as it has the 26196 wrapper and the other projects use the 29198? Changing the wrapper priority is a management headache as these are created on task launch so needs constant attention? I don't seem much change with .14 but normally takes a few day to flush out any issues. |
Send message Joined: 15 Jun 08 Posts: 2500 Credit: 248,633,844 RAC: 126,444 |
Don't know if all of the following methods work on windows and how much effort it is to implement them. On my linux boxes I use a combination of: 1. starting the BOINC client explicitely with a distinct nice level 2. using <no_priority_change>, <process_priority> and <process_priority_special> in cc_config.xml. An explanation can be found here. 3. a script that periodically (once per minute) checks for recently started VMs/vboxwrappers and renices their nice level. To get a stable setting the configured values can be different for each computer, so lots of testing is necessary. |
Send message Joined: 27 Sep 08 Posts: 817 Credit: 684,058,907 RAC: 136,472 |
Will give a try with the prio settings, this is better as they will just launch with correct settings. |
©2024 CERN