Message boards : Number crunching : ERROR WHILE COMPUTING (11 WUs out of 50 in the last 2 days)
Message board moderation

To post messages, you must log in.

AuthorMessage
San-Fernando-Valley

Send message
Joined: 26 Mar 16
Posts: 30
Credit: 1,258,609
RAC: 0
Message 34264 - Posted: 4 Feb 2018, 9:04:39 UTC

I went through your checklist. In answer to point ...

1) I'm using BOINC x64 client (Manager) 7.8.3 for Win7 and Win10 (on all rigs),

2) I'm using VirtualBox 5.2.6 (Win7 + 10) on all but one rig,
I'm also using VirtualBox 5.2.2 (Win7) on one other rig,
I don't use Hyper-V (?) or Docker,

3) correct ExtensionPack is installed (of no relevance here),

4) VT-X is and has been on,

5) command in client_state.xml shows the nummer as 0 (zero),

6) RAM = 64GB on each,
plenty disk space (>250GB each rig),

7) In- and Out-communications are OK,

8) AVIRA anti-virus program poses no problem

9) and 10) I'm not running ATLAS (to many problems in the past),
the errors show up in CMS, LHCb and Theory Simulation.

I'm not running any other project besides LHC (at the same time).
The cpu-times vary from 97 to 13,999 secs.
I'm not overclocking - everything at stock ...
Using 4 cores.

ANY idea what I'm doing wrong?
ID: 34264 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2401
Credit: 225,425,468
RAC: 123,606
Message 34265 - Posted: 4 Feb 2018, 9:35:51 UTC - in response to Message 34264.  

... ANY idea what I'm doing wrong?

As far as I can see only CMS tasks failed on your computers.
The error log tells you what happened:
"207 (0x000000CF) EXIT_NO_SUB_TASKS"

This means that the project wasn't able to deliver a workpackage to run inside a fresh VM and is mostly caused by a WMAgent outage.
You may check the CMS MB to see if other users report the same problem or if the project team (Ivan) suggests to stop pulling CMS tasks for a while.
ID: 34265 · Report as offensive     Reply Quote
San-Fernando-Valley

Send message
Joined: 26 Mar 16
Posts: 30
Credit: 1,258,609
RAC: 0
Message 34586 - Posted: 12 Mar 2018, 8:33:58 UTC - in response to Message 34265.  

Thanks for your answer. I appreciate it!

But there are two WU aborted because of "no network connection" -- after almost 4 hours of computing!
Why doesn't the WU wait?

"exit init failure" is another one that doesn't really tell ME anything.

Errors like that stop me from enjoying crunching time ...

Have a nice day.
ID: 34586 · Report as offensive     Reply Quote
San-Fernando-Valley

Send message
Joined: 26 Mar 16
Posts: 30
Credit: 1,258,609
RAC: 0
Message 34840 - Posted: 1 Apr 2018, 18:21:25 UTC - in response to Message 34265.  

... ANY idea what I'm doing wrong?

As far as I can see only CMS tasks failed on your computers.
The error log tells you what happened:
"207 (0x000000CF) EXIT_NO_SUB_TASKS"

This means that the project wasn't able to deliver a workpackage to run inside a fresh VM and is mostly caused by a WMAgent outage.
You may check the CMS MB to see if other users report the same problem or if the project team (Ivan) suggests to stop pulling CMS tasks for a while.


OK - so I waited almost 2 months now and tried a couple of CMS WUs- and now I get, after over 2 hours elapsed time the following status message:

Postponed: VM job unmanageable, restarting later

At the same time LHCB, sixtrack and theory are running nicely.
Furthermore, there are no more WU being downloaded -

There is enough Ram (64 GB), there is enough free disk space (>400GB) and it is full moon outside ...
So much respective unattended running of the LHC project.

As far as I understand using VBOX takes all the problems ouf crunching since one doesn't have to adapt the progs to the crunchers rig ???
Great idea.

Happy Easter and the likes ...

I would appreciate further help/ideas ...
ID: 34840 · Report as offensive     Reply Quote
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 803
Credit: 649,966,497
RAC: 240,536
Message 34841 - Posted: 1 Apr 2018, 19:09:55 UTC

I sometimes get the Postponed: VM job unmanageable, restarting later. Seems most common after upgradign virtualbox, I just abort them and future tasks are good.

Your PC must be configured OK as Theory and LHCb are running.
ID: 34841 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1273
Credit: 8,480,147
RAC: 2,155
Message 34842 - Posted: 1 Apr 2018, 19:11:02 UTC - in response to Message 34840.  

Postponed: VM job unmanageable, restarting later
..
..
..
I would appreciate further help/ideas ...

If you do nothing the job will resume 86400 seconds (1 day) later.
Restarting BOINC will try a resume immediately.
ID: 34842 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Nov 14
Posts: 602
Credit: 24,371,321
RAC: 0
Message 34843 - Posted: 1 Apr 2018, 21:02:33 UTC - in response to Message 34840.  

There is enough Ram (64 GB), there is enough free disk space (>400GB) and it is full moon outside ...
So much respective unattended running of the LHC project.

Make sure that the BOINC "Computing preferences" in the "Disk and memory" tab allow for the use of sufficient memory and disk space. Simply having sufficient installed capacity is not enough.
ID: 34843 · Report as offensive     Reply Quote
San-Fernando-Valley

Send message
Joined: 26 Mar 16
Posts: 30
Credit: 1,258,609
RAC: 0
Message 34844 - Posted: 2 Apr 2018, 6:54:59 UTC - in response to Message 34843.  


Make sure that the BOINC "Computing preferences" in the "Disk and memory" tab allow for the use of sufficient memory and disk space. Simply having sufficient installed capacity is not enough.


Thanks for the tip,
but I am well aware of this.

Following is set/checked under computer preferences:
DISK:
leave at least 0.1 GB free
use no more than 90% of total

MEMORY:
when computer is in use, use at least 95%
when computer is not in use, use at most 95%
leave non GPU tasks in memory ...

So this can not be the problem.
ID: 34844 · Report as offensive     Reply Quote
San-Fernando-Valley

Send message
Joined: 26 Mar 16
Posts: 30
Credit: 1,258,609
RAC: 0
Message 34845 - Posted: 2 Apr 2018, 6:59:16 UTC - in response to Message 34842.  


If you do nothing the job will resume 86400 seconds (1 day) later.
Restarting BOINC will try a resume immediately.


Just restarted BIONC - and like magic new WUs were downloaded and the postponed job is running again!

Thanks for your response!
ID: 34845 · Report as offensive     Reply Quote
San-Fernando-Valley

Send message
Joined: 26 Mar 16
Posts: 30
Credit: 1,258,609
RAC: 0
Message 34862 - Posted: 3 Apr 2018, 5:27:49 UTC

I am getting the "postponed: VM job unmanegable ..." message on three of my rigs.
Since BOINC does not download WUs till the postponed WU is automatically restarted, I have stopped executing all VBOX projects (theory, cms, lhcbs) till this problem gets fixed.

I made this decision because I am also now getting the following message:

"Postponed: Waiting to acquire slot directory lock. Another instance may be running"

As I am running my rigs more or less unattended, I missout on crunching time, because of the matter that new WU are not downloaded when other
ones (six track) have finished.

When I have time I will update VBox from 5.2.6 to 5.2.8 - maybe the problem has been fixed.

Have a nice day ...
ID: 34862 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1687
Credit: 103,016,226
RAC: 126,246
Message 34865 - Posted: 3 Apr 2018, 11:30:44 UTC - in response to Message 34862.  

"Postponed: Waiting to acquire slot directory lock. Another instance may be running"
I had the same problem a few weeks ago. So I opened the Oracle Virtual Box Manager and noticed that some VM jobs had hung up themselves. They showed up in the VB Manager in addition ot the ones that were listed in the BOINC Manager.
So what I did was: I deleted these "dead" jobs, and that was it. Everything worked fine again (I don't remember though whether I closed down BOINC and restarted it - mayby this must be done).
ID: 34865 · Report as offensive     Reply Quote
San-Fernando-Valley

Send message
Joined: 26 Mar 16
Posts: 30
Credit: 1,258,609
RAC: 0
Message 34871 - Posted: 4 Apr 2018, 13:06:07 UTC - in response to Message 34865.  

Well, in my case here the number of jobs (WUs) shown in VBOX is the same as in BOINC.
Some are powered off - some are saved - depending

Waiting a day or so (or restarting BOINC) seems to solve the problem/s.
The bad part is, that during the waiting period of one day (if unattended), BOINC will not download and start any other LHC WU (i.e. six track which
is non VBOX)!

The WUs finish OK and without error.

Also, I do not like the idea of having to "monitor" the LHC WUs - maybe I'm a bit picky, but it is not my job to solve these things.

I'm wondering why, that no other crunchers are having the same "troubles" -- I'm not doing anything exotic.

Thanks for your suggestions.
ID: 34871 · Report as offensive     Reply Quote
San-Fernando-Valley

Send message
Joined: 26 Mar 16
Posts: 30
Credit: 1,258,609
RAC: 0
Message 34875 - Posted: 5 Apr 2018, 10:10:10 UTC

For those that have the same issues:

Reducing the % CPU-cores used solves the problem. In my case form 100 to 75% !

Which makes no fun, since the rigs aren't running full-power ...
ID: 34875 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2401
Credit: 225,425,468
RAC: 123,606
Message 34876 - Posted: 5 Apr 2018, 10:28:14 UTC - in response to Message 34875.  

I believe they run at 100% but may be not at 100% CPU.
It may be confusing but the CPU load alone may be the wrong value to look at.
ID: 34876 · Report as offensive     Reply Quote
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 803
Credit: 649,966,497
RAC: 240,536
Message 34954 - Posted: 11 Apr 2018, 6:41:43 UTC

i have less error with the 5.1.x branch with high utilisation, there is some error after switching but they go away after a new batch of WU's
ID: 34954 · Report as offensive     Reply Quote

Message boards : Number crunching : ERROR WHILE COMPUTING (11 WUs out of 50 in the last 2 days)


©2024 CERN