Questions and Answers : Windows : CMS Simulation work and all of them get stuck on random % and all of them got aborted.
Message board moderation

To post messages, you must log in.

AuthorMessage
Sandman192

Send message
Joined: 8 Oct 07
Posts: 6
Credit: 572,360
RAC: 0
Message 45784 - Posted: 5 Dec 2021, 5:19:55 UTC
Last modified: 5 Dec 2021, 5:20:10 UTC

I just got a bunch of CMS Simulation and all of them get stuck on random % and all of them showing they got aborted long after from VirtualBox manager and still stuck in BOINC manger and VirtualBox. 7 of them from 18.407% to 0%.

Windows 10
BOINC v7.16.20
VirtualBox v6.1.30
ID: 45784 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 1626
Credit: 77,853,582
RAC: 264,301
Message 45785 - Posted: 5 Dec 2021, 7:08:14 UTC - in response to Message 45784.  

You can reduce your Number of CMS-Tasks in LHC-Prefs.
It seem you have 12 Threads and 12 Tasks running.
This is some overloading of your system.
ID: 45785 · Report as offensive     Reply Quote
Jonathan

Send message
Joined: 25 Sep 17
Posts: 76
Credit: 2,098,567
RAC: 12
Message 45786 - Posted: 5 Dec 2021, 9:59:28 UTC - in response to Message 45784.  

Use your LCH@home preferences to limit the number of work units.
Set Max # jobs to the number of true cores or less that your processor has.
Virtual Box jobs run at normal priority and not a lower priority like legacy Boinc jobs.
ID: 45786 · Report as offensive     Reply Quote
Sandman192

Send message
Joined: 8 Oct 07
Posts: 6
Credit: 572,360
RAC: 0
Message 45865 - Posted: 15 Dec 2021, 0:41:05 UTC

BOINC is automatic when it comes to RAM. BOINC will not go over projects that use lots of RAM. Note: when I said I'm only using 7 out of the 12 cores I have because of the lack of RAM I have. So, no it's not because it's running out of RAM. It's just not running any more CMS because of the RAM I have.
I've been running BOINC for years and seen other projects that are set to use all cores and lots of RAM but never used all cores because of the RAM limit and use other projects that used less RAM to use the rest of my cores. Note: That's including VBOX

I had no issues for years on this until now. CMS ran fine before.

Updating Windows, BOINC, Drivers, VBOX may have something to do with this new problem.
ID: 45865 · Report as offensive     Reply Quote
Sandman192

Send message
Joined: 8 Oct 07
Posts: 6
Credit: 572,360
RAC: 0
Message 45866 - Posted: 15 Dec 2021, 0:48:36 UTC - in response to Message 45785.  
Last modified: 15 Dec 2021, 0:52:08 UTC

12 Tasks running: Yes. Only 7 CMS running. Other projects are running because they use less RAM. I've seen this happen on many occasions. And if it was overloading my system then that other projects would have frozen or slow down too. Which they have not and finished on time.
Plus, all I have at the time is only 7 CMS download at that time anyway.
ID: 45866 · Report as offensive     Reply Quote
Jonathan

Send message
Joined: 25 Sep 17
Posts: 76
Credit: 2,098,567
RAC: 12
Message 45867 - Posted: 15 Dec 2021, 1:02:38 UTC - in response to Message 45866.  

Are you still having trouble? We can only go off the information you provide.
ID: 45867 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jun 08
Posts: 2051
Credit: 155,226,811
RAC: 133,363
Message 45868 - Posted: 15 Dec 2021, 9:00:36 UTC - in response to Message 45866.  

... all I have at the time is only 7 CMS download at that time anyway ...

According to your computer's tasklist there are currently 41 CMS tasks in progress:
https://lhcathome.cern.ch/lhcathome/results.php?hostid=10638796&offset=0&show_names=0&state=0&appid=
Did you reset the project (more than once?) without reporting the tasks (even failed or cancelled ones) prior to the reset?

7 CMS tasks running concurrently would allocate a bit more than 14 GB RAM.
Your computer has 32 GB - more than enough as long as other processes don't require all of that.
Nonetheless, the BOINC settings should be checked to ensure the upper limits there are high enough.


You currently have 1 successful CMS task:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=334918754
Even that one shows some trouble:
2021-12-14 19:37:26 (4012): Guest Log: [INFO] CMS application starting. Check log files.
2021-12-14 19:46:28 (4012): VM state change detected. (old = 'running', new = 'paused')
2021-12-14 19:46:51 (4012): VM state change detected. (old = 'paused', new = 'running')
2021-12-14 20:53:13 (4012): VM state change detected. (old = 'running', new = 'paused')
2021-12-14 21:05:26 (4012): VM state change detected. (old = 'paused', new = 'running')
2021-12-14 21:06:02 (4012): VM state change detected. (old = 'running', new = 'paused')
2021-12-14 21:15:38 (4012): VM state change detected. (old = 'paused', new = 'running')

The 1st suspend happens just a few minutes after "CMS application starting".
That's a critical phase where CMS connects to WMAgent/HTCondor to get a subtask.
A paused VM won't finish the communication with the project servers and they would mark the subtask as lost.
When your VM resumes it will try to continue the communication the server has already dropped.

In this case the suspend may luckily have happened without disturbing a critical connection.

In addition that's the phase where the VM allocates lots of additional RAM.
The latter might be the reason why BOINC suspends the task.
The resume following just a few seconds later points into the same direction.
Hence the suggestion to check the RAM limits.



https://lhcathome.cern.ch/lhcathome/result.php?resultid=334918401
2021-12-03 00:49:35 (11232): Guest Log: [INFO] CMS application starting. Check log files.
2021-12-03 01:48:12 (11232): VM state change detected. (old = 'running', new = 'paused')
2021-12-04 22:33:53 (11232): Stopping VM.
2021-12-04 22:33:53 (11232): Error in stop VM for VM: -108

It appears that this task was running fine until 2021-12-03 01:48:12.
The next log entry was made far more than 24h later whereas the maximum time limit is 18h.

At least 4 other tasks show similar lines, all were suspended within the same minute.
Depending on the BOINC settings suspending a task causes a snapshot of the complete VM RAM (2 GB) to be written to the disk.
Writing 5x2 GB (or more) to the disk puts a huge load on the I/O system.
If possible, this should not be started concurrently.



... I had no issues for years on this until now. CMS ran fine before. ...

At least 1 of your Cosmology tasks (also on vbox) shows a similar error:
http://www.cosmologyathome.org/result.php?resultid=51292341
ID: 45868 · Report as offensive     Reply Quote

Questions and Answers : Windows : CMS Simulation work and all of them get stuck on random % and all of them got aborted.


©2022 CERN