1) Message boards : Number crunching : Postponed: VM job unmanageble, restarting later ????? (Message 35733)
Posted 2 Jul 2018 by San-Fernando-Valley
Post:
Witzbold
2) Message boards : Number crunching : Postponed: VM job unmanageble, restarting later ????? (Message 35728)
Posted 1 Jul 2018 by San-Fernando-Valley
Post:

The best we can do as crunchers is accept that reality and work with it patiently. If you can't find the patience then you need to decide whether or not crunching is for you.

... OS as unstable and poorly designed as Windoze you're just asking for tons of trouble. I would consider formatting the drive and reinstalling everything from scratch and this time going with a real OS instead of Windoze. Then learn how to walk before you run. Turning on ALL the applications at this project is likely a mistake. Setting "unlimited cores" would be another mistake.



This is the type of answers we all love !

Seems like you have been personally insulted ...

But, just try to have a nice day !
Over and out.
3) Message boards : Number crunching : Postponed: VM job unmanageble, restarting later ????? (Message 35727)
Posted 1 Jul 2018 by San-Fernando-Valley
Post:
On the Top500 list IBM has reached the top with its Summit computer, which makes large use of nVidia Tesla GPU boards at 200 petaflops. The second is a Chinese supercomputer which was first last time, the third is another American computer, Sierra. The Piz Daint Swiss computer which was third is now sixth, the best Italian is thirteenth. Things change rapidly in the supercomputer world.
Tullio

Thanks - but has nothing to do with this thread ...
Over and out.
4) Message boards : Number crunching : Postponed: VM job unmanageble, restarting later ????? (Message 35726)
Posted 1 Jul 2018 by San-Fernando-Valley
Post:

But what used to work is not working so well now, as everything is falling apart in various degrees. I think the fact of life is that LHC is the most advanced physics project in the world, and it has the most complicated computer and network structure as a necessary part of it. Furthermore, it was not developed for the home users running BOINC, as are most BOINC projects, but was developed for the advanced computing capabilities of similar large institutions (Fermilab, etc.) around the world. They probably don't use VBox at all. So we are sort of an afterthought. It is not that they don't appreciate our efforts, but we are the tail of the dog, not the head of it.


NOW that is an acceptable answer !
Thank you for explaining the real thing !

Why does LHC bother with us at all ?

One should not confuse the complexity of physics (-projects) with the simplicity of VirtualBox.

Have a nice Sunday.
5) Message boards : Number crunching : Postponed: VM job unmanageble, restarting later ????? (Message 35717)
Posted 30 Jun 2018 by San-Fernando-Valley
Post:
ATLAS now has an elapsed time of almost ten hours - taking so far the last three hours for 10 seconds of remaining run time !
At the same time theory WU is predicting 1 to 2 days (!) of remaining time.
LHCb has upped the estimated time to around 20 hours !


Don't rely on the times.
They are more or less fake as they are based on fixed input parameters.
ATLAS: has longer or shorter batches
Theory, LHCb (,CMS): designed to run 12h+ but time calculation is based on the watchdog limit of 18h

Theory special: Today a new app version has been introduced. It needs a couple of days until your BOINC client corrects the times.



Some stderr.txt files are now available.
Most of them show that your hosts are much too busy (for whatever reason).
Thus your BOINC client, VirtualBox, vboxwrapper and VMs run into timing/priority problems.


There are lots of blank lines in your logs plus messages like this:
2018-06-28 20:54:21 (2308): Powering off VM.
2018-06-28 20:59:22 (2308): VM did not power off when requested.
2018-06-28 20:59:22 (2308): VM was successfully terminated.


2018-06-28 21:13:37 (5104): Powering off VM.
2018-06-28 21:18:38 (5104): VM did not power off when requested.
2018-06-28 21:18:38 (5104): VM was NOT successfully terminated.


2018-06-28 11:19:33 (3948): ERROR: Vboxwrapper lost communication with VirtualBox, rescheduling task for a later time.




ATLAS 1-core may run with only 3500 MB but to be on the save side you may configure 4800 MB via an app_config.xml:
2018-06-28 08:35:04 (3904): Setting Memory Size for VM. (3500MB)
2018-06-28 08:35:05 (3904): Setting CPU Count for VM. (1)



Your BOINC client was terminated before it could save all VMs and other relevant files.
How many VMs (all together) do you run concurrently? Seems to be too much.
Same problem occurs when you restart too many VMs concurrently.
This is what Erich56 already mentioned.
09:57:41 (3984): BOINC client no longer exists - exiting
09:57:41 (3984): timer handler: client dead, exiting
09:57:52 (3984): BOINC client no longer exists - exiting
09:57:52 (3984): timer handler: client dead, exiting
09:58:03 (3984): BOINC client no longer exists - exiting
09:58:03 (3984): timer handler: client dead, exiting
09:58:04 (4896): Can't acquire lockfile (32) - waiting 35s
09:58:13 (3984): BOINC client no longer exists - exiting
09:58:13 (3984): timer handler: client dead, exiting
09:58:23 (3984): BOINC client no longer exists - exiting
09:58:23 (3984): timer handler: client dead, exiting
09:58:33 (3984): BOINC client no longer exists - exiting
09:58:33 (3984): timer handler: client dead, exiting
09:58:39 (4896): Can't acquire lockfile (32) - exiting
09:58:39 (4896): Error: The process cannot access the file because it is being used by another process.



Thanks for the very comprehensive info.
These informations exceed my horizon - partially !

I'll try to respond:

TIMES -- Ok, the times don't really bother me - except that I have troubles planning my "structured day",

TERMINATION -- YES, I got very unpatient at the end of rer-re-re-trying that situation and just "cut the line" forcibly,

VM - There is only ONE VB (VirtualBox) running per rig. In this one VB I see the four WUs and their status.

SIDE NOTE: Shouldn't these WUs run WITHOUT me/us having to fiddel around with apps etc.? - My 2 cents of griping.
6) Message boards : Number crunching : Postponed: VM job unmanageble, restarting later ????? (Message 35716)
Posted 30 Jun 2018 by San-Fernando-Valley
Post:
The problem with the lockfile acquisition is indeed indicative of a crash or not giving the VM enough time to shutdown properly before shutting down the OS. BOINC will not fix that problem for you. You must find out which slot dir that lockfile is in then delete it manually from an admin account though I doubt that action alone will cure all your problems. Sounds like you need to remove the LHC project and then re-add it.


Thanks for your time answering me.
But I would like to point out, that I am not experiencing/responsible for any crashes - matter of fact: what is a crash - what do you mean by that?

Let me explain my situation again from the beginning:

1. Started my rigs (three Win7 - one Win10) - all is fine !
2. Started BOINC 7.10.2 - OK !
3. Checked VirtualBox 5.2.12 if anything was left behind - nothing there (had not run VB for some time) !
4. Requested Tasks for LHC (all subprojects checked in LHC prefs) in BOINC - no other projects running !
5. RAM 64GB !
6. SSDs fast and large enough (NVMe Samsung 500GB - more or less empty) !
7. BOINC Options fully "open" - no restrictions !
8. Hyperthreading off - so I have 4 cores !
9. No overclocking !
10. The request for tasks for LHC downloaded per core one LHC (mixed subprojects - but always one ATLAS included) !
11. After a longer while (10 minutes ? - don't remember) ATLAS gets the "postponed" message - for no reason whatsoever !
12. The other (mixed) three keep on running fine !
13. Checking VB shows ATLAS powered off !
14. As time goes by, the "remaining exc. time" keeps on going up - way up !
15. So now I have the situation, that one fourth of each rig is doing nothing - ATLAS blocking one core - nice !
16. Suspended LHC !
17. Waited for VB to shutoff its machines correctly - takes quite long !
18. THEN I stopt BOINC !
19. Waited a while - then LOGOFF for rig user !
20. RESTART for the rig !
21. Went on with point 1. above !
22. In BOINC ticked RESUME for LHC !
23. ALL four WUs start - even the postponed ATLAS WU !
24. After a while (see above point 11.) ATLAS againgoes into "postponed ..." !
25. Retried the whole procedure again - same results - EXCEPT this time other non-ATLAS WUs show the other message (slots or something ...), while ATLA runs ok !
26. Furthermore, the "remaining estimated run time" for all WUs goes up and up (on all rigs) - extremely fast - to 1 day or 2 days and more ... !

Hope my above list is somewhat complete and informative.

Any further hints/tips/comments?
Would appreciate it !!
7) Message boards : Number crunching : Postponed: VM job unmanageble, restarting later ????? (Message 35686)
Posted 28 Jun 2018 by San-Fernando-Valley
Post:
there are 2 thoughs on this:
...
So, my guess would be that either one of the above points, or both, are the reason for your problems.


You might have a point there - BUT I had to shutdown the computer BECAUSE of the "postponed ..." message (trying to revive the ATLAS WU).
So this "action" of mine probably is the reason for the problems that followed afterwards.

So what is the remedy for the initial problem?
Just don't crunch ATLAS.
Same remedy for CMS.

Thanks for your time - I appreciate it.

I will ABORT the WUs and keep on crunching other projects.

Have a nice day.
8) Message boards : Number crunching : Postponed: VM job unmanageble, restarting later ????? (Message 35683)
Posted 28 Jun 2018 by San-Fernando-Valley
Post:
The idea is not to have to think/analyse too much ...
Sometimes looking/searching for problems is challenging - sometimes even fun -

I now have a new interesting situation:

ATLAS now has an elapsed time of almost ten hours - taking so far the last three hours for 10 seconds of remaining run time !

At the same time theory WU is predicting 1 to 2 days (!) of remaining time.

LHCb has upped the estimated time to around 20 hours !

This is the state on all four concerned rigs. They are using between 5 and 9 GB RAM.

CPU load varies nicely - so something is beeing crunched.

I will wait another one hours and then flip a coin (the one with both sides the same) and ABORT all.

Maybe someone has some last tip or hint?
9) Message boards : Number crunching : Postponed: VM job unmanageble, restarting later ????? (Message 35681)
Posted 28 Jun 2018 by San-Fernando-Valley
Post:
There is, unfortunately, not yet a recent stderr.txt from your WUs that can be analysed.

The older logs as well as the error messages
1. "postponed: VM job unmanageable ..."
2. "Postponed: Waiting to acquire slot directory lock. Another instance may be running"
point out a local problem rather than a general problem.

(1.) could be caused by the RAM setting you configured in your BOINC client.
(2.) could be caused by remains of older crashes.

Did you recently
- restarted your hosts
- reinstalled BOINC
- cleaned your VBox environment
- reset the project?

How many cores do you use for your ATLAS WUs?
How much RAM is configured?


Thanks for your reply:

RAM settings in BOINC are: no more than 90% of total (the rig has 64GB)
No crashes ... as I said no "dead" entries in VirtualBox.
Of course I restarted the host - they don't run 24/7/365 ...
No, I haven't reinstalled BOINC - why should I ?
No, I did not reset the project.

I use one core per WU (ATLAS, Theaory, LHCb) - in other words I'm playing it save!

I would like to point out, that I have/had other projects running nicely in the past years - some of them up to 16GB RAM usage AND using
hyperthreading (2x4 cores) ... (turned off now).
10) Message boards : Number crunching : Postponed: VM job unmanageble, restarting later ????? (Message 35680)
Posted 28 Jun 2018 by San-Fernando-Valley
Post:
hello, San Fernando

while for me, too, it's not clear by what your problems are caused, one of your lines jumped into my eye:

-- fast SSDs (NVMe Samsung 500GB)

Are you really sure you want to crunch LHC VM tasks with a SSD? Particularly Atlas writes tons of data to the disk.
When I started crunching Atlas with my new PC which was equipped with a SSD, I quickly figured that 4 (or was it even only 3?) concurrently running ATLAS tasks were writing up to 200GB data per day (!). So, it was clear to me that the TBW value of the SSD would be reached within a year, if not earlier (although meanwhile one could read in various forums that some people's SSDs have reached a manyfold of the indicated TBW).

So, once your ATLAS tasks will run well again, you might give a thought to operate VM crunching on a separate HDD.
Just my advice - whatever it's worth :-)


Thanks for the advice - but I am absolutely not concerned about the SSDs ...

I'm concerned about the behavior of the various WUs from LHC ...
11) Message boards : Number crunching : Postponed: VM job unmanageble, restarting later ????? (Message 35679)
Posted 28 Jun 2018 by San-Fernando-Valley
Post:
Have this also,
thinking it is a RAM-problem.
Atlas-tasks are dynamicly growing to use more RAM.
When there is no more RAM avalaible in the PC than postponed...
Every better answer is welcome for us volunteers.


please read my text carefully - ALL MY CONCERNED RIGS HAVE 64GB RAM !!

Taskmanager shows, depending, between 5 and 9 GB used !

So it must be something else ..
12) Message boards : Number crunching : Postponed: VM job unmanageble, restarting later ????? (Message 35659)
Posted 28 Jun 2018 by San-Fernando-Valley
Post:
I wonder why the LHC-team doesn't respond to these "problems".
I know they are busy, but so are we spending time on these issues (for months).

I am also receiving the message "Postponed: VM job unmanagable, restarting later ..." !

Suspending the project, then stopping BOINC and restarting the rig solves the problem - BUT only for a short time.

While trying this method to recover, I am receiving message "waiting for slot ..." (I don't remember the complete text).

I noticed that of the four WUs running per rig, one of the four gets postponed while the three others run happily.
After restarting the rig (shutdown and restart) it is the other way around: three of the four wait for slots (?) and the one previously postponed runs nicely.
For a while - then the whole process repeats itself ...

ATLAS seems to be the troublemaker - it is the one that always becomes postponed first after all four WUs where running for a couple of minutes.
AND, please, don't suggest to just not run ATLAS !!

Just for the records:
-- hyperthreading is turned off,
-- no overclocking,
-- plenty ram (64GB),
-- fast SSDs (NVMe Samsung 500GB),
-- no overheating,
-- fast GPU (here of no concern),
-- CPU load varies - from around 50 to 100%,
-- no other projects are running,
-- rigs are just crunching - doing nothing else,
-- no dead/hungup "machines" in VirtualBox 5.2.12
and lots of sunshine outside with blue skys.

I wonder if the LHC-team (L.ets H.appily C.runch) really realises that we volunteers are "donating" time and money trying/wanting to help THEM ?

Thanks for reading up to here - so have a nice day ...
13) Message boards : Number crunching : ERROR WHILE COMPUTING (11 WUs out of 50 in the last 2 days) (Message 34875)
Posted 5 Apr 2018 by San-Fernando-Valley
Post:
For those that have the same issues:

Reducing the % CPU-cores used solves the problem. In my case form 100 to 75% !

Which makes no fun, since the rigs aren't running full-power ...
14) Message boards : Number crunching : ERROR WHILE COMPUTING (11 WUs out of 50 in the last 2 days) (Message 34871)
Posted 4 Apr 2018 by San-Fernando-Valley
Post:
Well, in my case here the number of jobs (WUs) shown in VBOX is the same as in BOINC.
Some are powered off - some are saved - depending

Waiting a day or so (or restarting BOINC) seems to solve the problem/s.
The bad part is, that during the waiting period of one day (if unattended), BOINC will not download and start any other LHC WU (i.e. six track which
is non VBOX)!

The WUs finish OK and without error.

Also, I do not like the idea of having to "monitor" the LHC WUs - maybe I'm a bit picky, but it is not my job to solve these things.

I'm wondering why, that no other crunchers are having the same "troubles" -- I'm not doing anything exotic.

Thanks for your suggestions.
15) Message boards : Number crunching : ERROR WHILE COMPUTING (11 WUs out of 50 in the last 2 days) (Message 34862)
Posted 3 Apr 2018 by San-Fernando-Valley
Post:
I am getting the "postponed: VM job unmanegable ..." message on three of my rigs.
Since BOINC does not download WUs till the postponed WU is automatically restarted, I have stopped executing all VBOX projects (theory, cms, lhcbs) till this problem gets fixed.

I made this decision because I am also now getting the following message:

"Postponed: Waiting to acquire slot directory lock. Another instance may be running"

As I am running my rigs more or less unattended, I missout on crunching time, because of the matter that new WU are not downloaded when other
ones (six track) have finished.

When I have time I will update VBox from 5.2.6 to 5.2.8 - maybe the problem has been fixed.

Have a nice day ...
16) Message boards : Number crunching : ERROR WHILE COMPUTING (11 WUs out of 50 in the last 2 days) (Message 34845)
Posted 2 Apr 2018 by San-Fernando-Valley
Post:

If you do nothing the job will resume 86400 seconds (1 day) later.
Restarting BOINC will try a resume immediately.


Just restarted BIONC - and like magic new WUs were downloaded and the postponed job is running again!

Thanks for your response!
17) Message boards : Number crunching : ERROR WHILE COMPUTING (11 WUs out of 50 in the last 2 days) (Message 34844)
Posted 2 Apr 2018 by San-Fernando-Valley
Post:

Make sure that the BOINC "Computing preferences" in the "Disk and memory" tab allow for the use of sufficient memory and disk space. Simply having sufficient installed capacity is not enough.


Thanks for the tip,
but I am well aware of this.

Following is set/checked under computer preferences:
DISK:
leave at least 0.1 GB free
use no more than 90% of total

MEMORY:
when computer is in use, use at least 95%
when computer is not in use, use at most 95%
leave non GPU tasks in memory ...

So this can not be the problem.
18) Message boards : Number crunching : ERROR WHILE COMPUTING (11 WUs out of 50 in the last 2 days) (Message 34840)
Posted 1 Apr 2018 by San-Fernando-Valley
Post:
... ANY idea what I'm doing wrong?

As far as I can see only CMS tasks failed on your computers.
The error log tells you what happened:
"207 (0x000000CF) EXIT_NO_SUB_TASKS"

This means that the project wasn't able to deliver a workpackage to run inside a fresh VM and is mostly caused by a WMAgent outage.
You may check the CMS MB to see if other users report the same problem or if the project team (Ivan) suggests to stop pulling CMS tasks for a while.


OK - so I waited almost 2 months now and tried a couple of CMS WUs- and now I get, after over 2 hours elapsed time the following status message:

Postponed: VM job unmanageable, restarting later

At the same time LHCB, sixtrack and theory are running nicely.
Furthermore, there are no more WU being downloaded -

There is enough Ram (64 GB), there is enough free disk space (>400GB) and it is full moon outside ...
So much respective unattended running of the LHC project.

As far as I understand using VBOX takes all the problems ouf crunching since one doesn't have to adapt the progs to the crunchers rig ???
Great idea.

Happy Easter and the likes ...

I would appreciate further help/ideas ...
19) Message boards : Number crunching : ERROR WHILE COMPUTING (11 WUs out of 50 in the last 2 days) (Message 34586)
Posted 12 Mar 2018 by San-Fernando-Valley
Post:
Thanks for your answer. I appreciate it!

But there are two WU aborted because of "no network connection" -- after almost 4 hours of computing!
Why doesn't the WU wait?

"exit init failure" is another one that doesn't really tell ME anything.

Errors like that stop me from enjoying crunching time ...

Have a nice day.
20) Message boards : Number crunching : ERROR WHILE COMPUTING (11 WUs out of 50 in the last 2 days) (Message 34264)
Posted 4 Feb 2018 by San-Fernando-Valley
Post:
I went through your checklist. In answer to point ...

1) I'm using BOINC x64 client (Manager) 7.8.3 for Win7 and Win10 (on all rigs),

2) I'm using VirtualBox 5.2.6 (Win7 + 10) on all but one rig,
I'm also using VirtualBox 5.2.2 (Win7) on one other rig,
I don't use Hyper-V (?) or Docker,

3) correct ExtensionPack is installed (of no relevance here),

4) VT-X is and has been on,

5) command in client_state.xml shows the nummer as 0 (zero),

6) RAM = 64GB on each,
plenty disk space (>250GB each rig),

7) In- and Out-communications are OK,

8) AVIRA anti-virus program poses no problem

9) and 10) I'm not running ATLAS (to many problems in the past),
the errors show up in CMS, LHCb and Theory Simulation.

I'm not running any other project besides LHC (at the same time).
The cpu-times vary from 97 to 13,999 secs.
I'm not overclocking - everything at stock ...
Using 4 cores.

ANY idea what I'm doing wrong?


Next 20


©2024 CERN