Questions and Answers :
Windows :
CMS Simulation work and all of them get stuck on random % and all of them got aborted.
Message board moderation
Author | Message |
---|---|
Send message Joined: 8 Oct 07 Posts: 13 Credit: 583,590 RAC: 0 |
I just got a bunch of CMS Simulation and all of them get stuck on random % and all of them showing they got aborted long after from VirtualBox manager and still stuck in BOINC manger and VirtualBox. 7 of them from 18.407% to 0%. Windows 10 BOINC v7.16.20 VirtualBox v6.1.30 |
Send message Joined: 2 May 07 Posts: 2240 Credit: 173,894,884 RAC: 3,757 |
You can reduce your Number of CMS-Tasks in LHC-Prefs. It seem you have 12 Threads and 12 Tasks running. This is some overloading of your system. |
Send message Joined: 25 Sep 17 Posts: 99 Credit: 3,425,566 RAC: 0 |
Use your LCH@home preferences to limit the number of work units. Set Max # jobs to the number of true cores or less that your processor has. Virtual Box jobs run at normal priority and not a lower priority like legacy Boinc jobs. |
Send message Joined: 8 Oct 07 Posts: 13 Credit: 583,590 RAC: 0 |
BOINC is automatic when it comes to RAM. BOINC will not go over projects that use lots of RAM. Note: when I said I'm only using 7 out of the 12 cores I have because of the lack of RAM I have. So, no it's not because it's running out of RAM. It's just not running any more CMS because of the RAM I have. I've been running BOINC for years and seen other projects that are set to use all cores and lots of RAM but never used all cores because of the RAM limit and use other projects that used less RAM to use the rest of my cores. Note: That's including VBOX I had no issues for years on this until now. CMS ran fine before. Updating Windows, BOINC, Drivers, VBOX may have something to do with this new problem. |
Send message Joined: 8 Oct 07 Posts: 13 Credit: 583,590 RAC: 0 |
12 Tasks running: Yes. Only 7 CMS running. Other projects are running because they use less RAM. I've seen this happen on many occasions. And if it was overloading my system then that other projects would have frozen or slow down too. Which they have not and finished on time. Plus, all I have at the time is only 7 CMS download at that time anyway. |
Send message Joined: 25 Sep 17 Posts: 99 Credit: 3,425,566 RAC: 0 |
Are you still having trouble? We can only go off the information you provide. |
Send message Joined: 15 Jun 08 Posts: 2528 Credit: 253,722,187 RAC: 73,124 |
... all I have at the time is only 7 CMS download at that time anyway ... According to your computer's tasklist there are currently 41 CMS tasks in progress: https://lhcathome.cern.ch/lhcathome/results.php?hostid=10638796&offset=0&show_names=0&state=0&appid= Did you reset the project (more than once?) without reporting the tasks (even failed or cancelled ones) prior to the reset? 7 CMS tasks running concurrently would allocate a bit more than 14 GB RAM. Your computer has 32 GB - more than enough as long as other processes don't require all of that. Nonetheless, the BOINC settings should be checked to ensure the upper limits there are high enough. You currently have 1 successful CMS task: https://lhcathome.cern.ch/lhcathome/result.php?resultid=334918754 Even that one shows some trouble: 2021-12-14 19:37:26 (4012): Guest Log: [INFO] CMS application starting. Check log files. 2021-12-14 19:46:28 (4012): VM state change detected. (old = 'running', new = 'paused') 2021-12-14 19:46:51 (4012): VM state change detected. (old = 'paused', new = 'running') 2021-12-14 20:53:13 (4012): VM state change detected. (old = 'running', new = 'paused') 2021-12-14 21:05:26 (4012): VM state change detected. (old = 'paused', new = 'running') 2021-12-14 21:06:02 (4012): VM state change detected. (old = 'running', new = 'paused') 2021-12-14 21:15:38 (4012): VM state change detected. (old = 'paused', new = 'running') The 1st suspend happens just a few minutes after "CMS application starting". That's a critical phase where CMS connects to WMAgent/HTCondor to get a subtask. A paused VM won't finish the communication with the project servers and they would mark the subtask as lost. When your VM resumes it will try to continue the communication the server has already dropped. In this case the suspend may luckily have happened without disturbing a critical connection. In addition that's the phase where the VM allocates lots of additional RAM. The latter might be the reason why BOINC suspends the task. The resume following just a few seconds later points into the same direction. Hence the suggestion to check the RAM limits. https://lhcathome.cern.ch/lhcathome/result.php?resultid=334918401 2021-12-03 00:49:35 (11232): Guest Log: [INFO] CMS application starting. Check log files. 2021-12-03 01:48:12 (11232): VM state change detected. (old = 'running', new = 'paused') 2021-12-04 22:33:53 (11232): Stopping VM. 2021-12-04 22:33:53 (11232): Error in stop VM for VM: -108 It appears that this task was running fine until 2021-12-03 01:48:12. The next log entry was made far more than 24h later whereas the maximum time limit is 18h. At least 4 other tasks show similar lines, all were suspended within the same minute. Depending on the BOINC settings suspending a task causes a snapshot of the complete VM RAM (2 GB) to be written to the disk. Writing 5x2 GB (or more) to the disk puts a huge load on the I/O system. If possible, this should not be started concurrently. ... I had no issues for years on this until now. CMS ran fine before. ... At least 1 of your Cosmology tasks (also on vbox) shows a similar error: http://www.cosmologyathome.org/result.php?resultid=51292341 |
©2024 CERN