Message boards :
Number crunching :
ERROR WHILE COMPUTING (11 WUs out of 50 in the last 2 days)
Message board moderation
Author | Message |
---|---|
Send message Joined: 26 Mar 16 Posts: 30 Credit: 1,258,609 RAC: 0 |
I went through your checklist. In answer to point ... 1) I'm using BOINC x64 client (Manager) 7.8.3 for Win7 and Win10 (on all rigs), 2) I'm using VirtualBox 5.2.6 (Win7 + 10) on all but one rig, I'm also using VirtualBox 5.2.2 (Win7) on one other rig, I don't use Hyper-V (?) or Docker, 3) correct ExtensionPack is installed (of no relevance here), 4) VT-X is and has been on, 5) command in client_state.xml shows the nummer as 0 (zero), 6) RAM = 64GB on each, plenty disk space (>250GB each rig), 7) In- and Out-communications are OK, 8) AVIRA anti-virus program poses no problem 9) and 10) I'm not running ATLAS (to many problems in the past), the errors show up in CMS, LHCb and Theory Simulation. I'm not running any other project besides LHC (at the same time). The cpu-times vary from 97 to 13,999 secs. I'm not overclocking - everything at stock ... Using 4 cores. ANY idea what I'm doing wrong? |
Send message Joined: 15 Jun 08 Posts: 2541 Credit: 254,608,838 RAC: 56,545 |
... ANY idea what I'm doing wrong? As far as I can see only CMS tasks failed on your computers. The error log tells you what happened: "207 (0x000000CF) EXIT_NO_SUB_TASKS" This means that the project wasn't able to deliver a workpackage to run inside a fresh VM and is mostly caused by a WMAgent outage. You may check the CMS MB to see if other users report the same problem or if the project team (Ivan) suggests to stop pulling CMS tasks for a while. |
Send message Joined: 26 Mar 16 Posts: 30 Credit: 1,258,609 RAC: 0 |
Thanks for your answer. I appreciate it! But there are two WU aborted because of "no network connection" -- after almost 4 hours of computing! Why doesn't the WU wait? "exit init failure" is another one that doesn't really tell ME anything. Errors like that stop me from enjoying crunching time ... Have a nice day. |
Send message Joined: 26 Mar 16 Posts: 30 Credit: 1,258,609 RAC: 0 |
... ANY idea what I'm doing wrong? OK - so I waited almost 2 months now and tried a couple of CMS WUs- and now I get, after over 2 hours elapsed time the following status message: Postponed: VM job unmanageable, restarting later At the same time LHCB, sixtrack and theory are running nicely. Furthermore, there are no more WU being downloaded - There is enough Ram (64 GB), there is enough free disk space (>400GB) and it is full moon outside ... So much respective unattended running of the LHC project. As far as I understand using VBOX takes all the problems ouf crunching since one doesn't have to adapt the progs to the crunchers rig ??? Great idea. Happy Easter and the likes ... I would appreciate further help/ideas ... |
Send message Joined: 27 Sep 08 Posts: 850 Credit: 692,713,859 RAC: 95,524 |
I sometimes get the Postponed: VM job unmanageable, restarting later. Seems most common after upgradign virtualbox, I just abort them and future tasks are good. Your PC must be configured OK as Theory and LHCb are running. |
Send message Joined: 14 Jan 10 Posts: 1422 Credit: 9,484,585 RAC: 1,882 |
Postponed: VM job unmanageable, restarting later If you do nothing the job will resume 86400 seconds (1 day) later. Restarting BOINC will try a resume immediately. |
Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0 |
There is enough Ram (64 GB), there is enough free disk space (>400GB) and it is full moon outside ... Make sure that the BOINC "Computing preferences" in the "Disk and memory" tab allow for the use of sufficient memory and disk space. Simply having sufficient installed capacity is not enough. |
Send message Joined: 26 Mar 16 Posts: 30 Credit: 1,258,609 RAC: 0 |
Thanks for the tip, but I am well aware of this. Following is set/checked under computer preferences: DISK: leave at least 0.1 GB free use no more than 90% of total MEMORY: when computer is in use, use at least 95% when computer is not in use, use at most 95% leave non GPU tasks in memory ... So this can not be the problem. |
Send message Joined: 26 Mar 16 Posts: 30 Credit: 1,258,609 RAC: 0 |
Just restarted BIONC - and like magic new WUs were downloaded and the postponed job is running again! Thanks for your response! |
Send message Joined: 26 Mar 16 Posts: 30 Credit: 1,258,609 RAC: 0 |
I am getting the "postponed: VM job unmanegable ..." message on three of my rigs. Since BOINC does not download WUs till the postponed WU is automatically restarted, I have stopped executing all VBOX projects (theory, cms, lhcbs) till this problem gets fixed. I made this decision because I am also now getting the following message: "Postponed: Waiting to acquire slot directory lock. Another instance may be running" As I am running my rigs more or less unattended, I missout on crunching time, because of the matter that new WU are not downloaded when other ones (six track) have finished. When I have time I will update VBox from 5.2.6 to 5.2.8 - maybe the problem has been fixed. Have a nice day ... |
Send message Joined: 18 Dec 15 Posts: 1821 Credit: 118,923,727 RAC: 31,866 |
"Postponed: Waiting to acquire slot directory lock. Another instance may be running"I had the same problem a few weeks ago. So I opened the Oracle Virtual Box Manager and noticed that some VM jobs had hung up themselves. They showed up in the VB Manager in addition ot the ones that were listed in the BOINC Manager. So what I did was: I deleted these "dead" jobs, and that was it. Everything worked fine again (I don't remember though whether I closed down BOINC and restarted it - mayby this must be done). |
Send message Joined: 26 Mar 16 Posts: 30 Credit: 1,258,609 RAC: 0 |
Well, in my case here the number of jobs (WUs) shown in VBOX is the same as in BOINC. Some are powered off - some are saved - depending Waiting a day or so (or restarting BOINC) seems to solve the problem/s. The bad part is, that during the waiting period of one day (if unattended), BOINC will not download and start any other LHC WU (i.e. six track which is non VBOX)! The WUs finish OK and without error. Also, I do not like the idea of having to "monitor" the LHC WUs - maybe I'm a bit picky, but it is not my job to solve these things. I'm wondering why, that no other crunchers are having the same "troubles" -- I'm not doing anything exotic. Thanks for your suggestions. |
Send message Joined: 26 Mar 16 Posts: 30 Credit: 1,258,609 RAC: 0 |
For those that have the same issues: Reducing the % CPU-cores used solves the problem. In my case form 100 to 75% ! Which makes no fun, since the rigs aren't running full-power ... |
Send message Joined: 15 Jun 08 Posts: 2541 Credit: 254,608,838 RAC: 56,545 |
I believe they run at 100% but may be not at 100% CPU. It may be confusing but the CPU load alone may be the wrong value to look at. |
Send message Joined: 27 Sep 08 Posts: 850 Credit: 692,713,859 RAC: 95,524 |
i have less error with the 5.1.x branch with high utilisation, there is some error after switching but they go away after a new batch of WU's |
©2024 CERN