Thread 'ERROR WHILE COMPUTING (11 WUs out of 50 in the last 2 days)'

Author	Message
San-Fernando-Valley Send message Joined: 26 Mar 16 Posts: 30 Credit: 1,258,609 RAC: 0	Message 34264 - Posted: 4 Feb 2018, 9:04:39 UTC I went through your checklist. In answer to point ... 1) I'm using BOINC x64 client (Manager) 7.8.3 for Win7 and Win10 (on all rigs), 2) I'm using VirtualBox 5.2.6 (Win7 + 10) on all but one rig, I'm also using VirtualBox 5.2.2 (Win7) on one other rig, I don't use Hyper-V (?) or Docker, 3) correct ExtensionPack is installed (of no relevance here), 4) VT-X is and has been on, 5) command in client_state.xml shows the nummer as 0 (zero), 6) RAM = 64GB on each, plenty disk space (>250GB each rig), 7) In- and Out-communications are OK, 8) AVIRA anti-virus program poses no problem 9) and 10) I'm not running ATLAS (to many problems in the past), the errors show up in CMS, LHCb and Theory Simulation. I'm not running any other project besides LHC (at the same time). The cpu-times vary from 97 to 13,999 secs. I'm not overclocking - everything at stock ... Using 4 cores. ANY idea what I'm doing wrong? ID: 34264 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2745 Credit: 302,486,450 RAC: 75,419	Message 34265 - Posted: 4 Feb 2018, 9:35:51 UTC - in response to Message 34264. ... ANY idea what I'm doing wrong? As far as I can see only CMS tasks failed on your computers. The error log tells you what happened: "207 (0x000000CF) EXIT_NO_SUB_TASKS" This means that the project wasn't able to deliver a workpackage to run inside a fresh VM and is mostly caused by a WMAgent outage. You may check the CMS MB to see if other users report the same problem or if the project team (Ivan) suggests to stop pulling CMS tasks for a while. ID: 34265 · Reply Quote

San-Fernando-Valley Send message Joined: 26 Mar 16 Posts: 30 Credit: 1,258,609 RAC: 0	Message 34586 - Posted: 12 Mar 2018, 8:33:58 UTC - in response to Message 34265. Thanks for your answer. I appreciate it! But there are two WU aborted because of "no network connection" -- after almost 4 hours of computing! Why doesn't the WU wait? "exit init failure" is another one that doesn't really tell ME anything. Errors like that stop me from enjoying crunching time ... Have a nice day. ID: 34586 · Reply Quote

San-Fernando-Valley Send message Joined: 26 Mar 16 Posts: 30 Credit: 1,258,609 RAC: 0	Message 34840 - Posted: 1 Apr 2018, 18:21:25 UTC - in response to Message 34265. ... ANY idea what I'm doing wrong? As far as I can see only CMS tasks failed on your computers. The error log tells you what happened: "207 (0x000000CF) EXIT_NO_SUB_TASKS" This means that the project wasn't able to deliver a workpackage to run inside a fresh VM and is mostly caused by a WMAgent outage. You may check the CMS MB to see if other users report the same problem or if the project team (Ivan) suggests to stop pulling CMS tasks for a while. OK - so I waited almost 2 months now and tried a couple of CMS WUs- and now I get, after over 2 hours elapsed time the following status message: Postponed: VM job unmanageable, restarting later At the same time LHCB, sixtrack and theory are running nicely. Furthermore, there are no more WU being downloaded - There is enough Ram (64 GB), there is enough free disk space (>400GB) and it is full moon outside ... So much respective unattended running of the LHC project. As far as I understand using VBOX takes all the problems ouf crunching since one doesn't have to adapt the progs to the crunchers rig ??? Great idea. Happy Easter and the likes ... I would appreciate further help/ideas ... ID: 34840 · Reply Quote

Toby Broom Volunteer moderator Send message Joined: 27 Sep 08 Posts: 939 Credit: 781,711,560 RAC: 76,983	Message 34841 - Posted: 1 Apr 2018, 19:09:55 UTC I sometimes get the Postponed: VM job unmanageable, restarting later. Seems most common after upgradign virtualbox, I just abort them and future tasks are good. Your PC must be configured OK as Theory and LHCb are running. ID: 34841 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1552 Credit: 10,071,487 RAC: 603	Message 34842 - Posted: 1 Apr 2018, 19:11:02 UTC - in response to Message 34840. Postponed: VM job unmanageable, restarting later .. .. .. I would appreciate further help/ideas ... If you do nothing the job will resume 86400 seconds (1 day) later. Restarting BOINC will try a resume immediately. ID: 34842 · Reply Quote

Jim1348 Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0	Message 34843 - Posted: 1 Apr 2018, 21:02:33 UTC - in response to Message 34840. There is enough Ram (64 GB), there is enough free disk space (>400GB) and it is full moon outside ... So much respective unattended running of the LHC project. Make sure that the BOINC "Computing preferences" in the "Disk and memory" tab allow for the use of sufficient memory and disk space. Simply having sufficient installed capacity is not enough. ID: 34843 · Reply Quote

San-Fernando-Valley Send message Joined: 26 Mar 16 Posts: 30 Credit: 1,258,609 RAC: 0	Message 34844 - Posted: 2 Apr 2018, 6:54:59 UTC - in response to Message 34843. Make sure that the BOINC "Computing preferences" in the "Disk and memory" tab allow for the use of sufficient memory and disk space. Simply having sufficient installed capacity is not enough. Thanks for the tip, but I am well aware of this. Following is set/checked under computer preferences: DISK: leave at least 0.1 GB free use no more than 90% of total MEMORY: when computer is in use, use at least 95% when computer is not in use, use at most 95% leave non GPU tasks in memory ... So this can not be the problem. ID: 34844 · Reply Quote

San-Fernando-Valley Send message Joined: 26 Mar 16 Posts: 30 Credit: 1,258,609 RAC: 0	Message 34845 - Posted: 2 Apr 2018, 6:59:16 UTC - in response to Message 34842. If you do nothing the job will resume 86400 seconds (1 day) later. Restarting BOINC will try a resume immediately. Just restarted BIONC - and like magic new WUs were downloaded and the postponed job is running again! Thanks for your response! ID: 34845 · Reply Quote

San-Fernando-Valley Send message Joined: 26 Mar 16 Posts: 30 Credit: 1,258,609 RAC: 0	Message 34862 - Posted: 3 Apr 2018, 5:27:49 UTC I am getting the "postponed: VM job unmanegable ..." message on three of my rigs. Since BOINC does not download WUs till the postponed WU is automatically restarted, I have stopped executing all VBOX projects (theory, cms, lhcbs) till this problem gets fixed. I made this decision because I am also now getting the following message: "Postponed: Waiting to acquire slot directory lock. Another instance may be running" As I am running my rigs more or less unattended, I missout on crunching time, because of the matter that new WU are not downloaded when other ones (six track) have finished. When I have time I will update VBox from 5.2.6 to 5.2.8 - maybe the problem has been fixed. Have a nice day ... ID: 34862 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1980 Credit: 160,774,607 RAC: 40,183	Message 34865 - Posted: 3 Apr 2018, 11:30:44 UTC - in response to Message 34862. "Postponed: Waiting to acquire slot directory lock. Another instance may be running" I had the same problem a few weeks ago. So I opened the Oracle Virtual Box Manager and noticed that some VM jobs had hung up themselves. They showed up in the VB Manager in addition ot the ones that were listed in the BOINC Manager. So what I did was: I deleted these "dead" jobs, and that was it. Everything worked fine again (I don't remember though whether I closed down BOINC and restarted it - mayby this must be done). ID: 34865 · Reply Quote

San-Fernando-Valley Send message Joined: 26 Mar 16 Posts: 30 Credit: 1,258,609 RAC: 0	Message 34871 - Posted: 4 Apr 2018, 13:06:07 UTC - in response to Message 34865. Well, in my case here the number of jobs (WUs) shown in VBOX is the same as in BOINC. Some are powered off - some are saved - depending Waiting a day or so (or restarting BOINC) seems to solve the problem/s. The bad part is, that during the waiting period of one day (if unattended), BOINC will not download and start any other LHC WU (i.e. six track which is non VBOX)! The WUs finish OK and without error. Also, I do not like the idea of having to "monitor" the LHC WUs - maybe I'm a bit picky, but it is not my job to solve these things. I'm wondering why, that no other crunchers are having the same "troubles" -- I'm not doing anything exotic. Thanks for your suggestions. ID: 34871 · Reply Quote

San-Fernando-Valley Send message Joined: 26 Mar 16 Posts: 30 Credit: 1,258,609 RAC: 0	Message 34875 - Posted: 5 Apr 2018, 10:10:10 UTC For those that have the same issues: Reducing the % CPU-cores used solves the problem. In my case form 100 to 75% ! Which makes no fun, since the rigs aren't running full-power ... ID: 34875 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2745 Credit: 302,486,450 RAC: 75,419	Message 34876 - Posted: 5 Apr 2018, 10:28:14 UTC - in response to Message 34875. I believe they run at 100% but may be not at 100% CPU. It may be confusing but the CPU load alone may be the wrong value to look at. ID: 34876 · Reply Quote

Toby Broom Volunteer moderator Send message Joined: 27 Sep 08 Posts: 939 Credit: 781,711,560 RAC: 76,983	Message 34954 - Posted: 11 Apr 2018, 6:41:43 UTC i have less error with the 5.1.x branch with high utilisation, there is some error after switching but they go away after a new batch of WU's ID: 34954 · Reply Quote