Thread 'CMS Simulation work and all of them get stuck on random % and all of them got aborted.'

Author	Message
Sandman192 Send message Joined: 8 Oct 07 Posts: 13 Credit: 583,590 RAC: 0	Message 45784 - Posted: 5 Dec 2021, 5:19:55 UTC Last modified: 5 Dec 2021, 5:20:10 UTC I just got a bunch of CMS Simulation and all of them get stuck on random % and all of them showing they got aborted long after from VirtualBox manager and still stuck in BOINC manger and VirtualBox. 7 of them from 18.407% to 0%. Windows 10 BOINC v7.16.20 VirtualBox v6.1.30 ID: 45784 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2305 Credit: 179,727,092 RAC: 8,570	Message 45785 - Posted: 5 Dec 2021, 7:08:14 UTC - in response to Message 45784. You can reduce your Number of CMS-Tasks in LHC-Prefs. It seem you have 12 Threads and 12 Tasks running. This is some overloading of your system. ID: 45785 · Reply Quote

Jonathan Send message Joined: 25 Sep 17 Posts: 99 Credit: 3,425,566 RAC: 0	Message 45786 - Posted: 5 Dec 2021, 9:59:28 UTC - in response to Message 45784. Use your LCH@home preferences to limit the number of work units. Set Max # jobs to the number of true cores or less that your processor has. Virtual Box jobs run at normal priority and not a lower priority like legacy Boinc jobs. ID: 45786 · Reply Quote

Sandman192 Send message Joined: 8 Oct 07 Posts: 13 Credit: 583,590 RAC: 0	Message 45865 - Posted: 15 Dec 2021, 0:41:05 UTC BOINC is automatic when it comes to RAM. BOINC will not go over projects that use lots of RAM. Note: when I said I'm only using 7 out of the 12 cores I have because of the lack of RAM I have. So, no it's not because it's running out of RAM. It's just not running any more CMS because of the RAM I have. I've been running BOINC for years and seen other projects that are set to use all cores and lots of RAM but never used all cores because of the RAM limit and use other projects that used less RAM to use the rest of my cores. Note: That's including VBOX I had no issues for years on this until now. CMS ran fine before. Updating Windows, BOINC, Drivers, VBOX may have something to do with this new problem. ID: 45865 · Reply Quote

Sandman192 Send message Joined: 8 Oct 07 Posts: 13 Credit: 583,590 RAC: 0	Message 45866 - Posted: 15 Dec 2021, 0:48:36 UTC - in response to Message 45785. Last modified: 15 Dec 2021, 0:52:08 UTC 12 Tasks running: Yes. Only 7 CMS running. Other projects are running because they use less RAM. I've seen this happen on many occasions. And if it was overloading my system then that other projects would have frozen or slow down too. Which they have not and finished on time. Plus, all I have at the time is only 7 CMS download at that time anyway. ID: 45866 · Reply Quote

Jonathan Send message Joined: 25 Sep 17 Posts: 99 Credit: 3,425,566 RAC: 0	Message 45867 - Posted: 15 Dec 2021, 1:02:38 UTC - in response to Message 45866. Are you still having trouble? We can only go off the information you provide. ID: 45867 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2760 Credit: 304,835,907 RAC: 107,838	Message 45868 - Posted: 15 Dec 2021, 9:00:36 UTC - in response to Message 45866. ]... all I have at the time is only 7 CMS download at that time anyway ...[/quote] According to your computer's tasklist there are currently 41 CMS tasks in progress: https://lhcathome.cern.ch/lhcathome/results.php?hostid=10638796&offset=0&show_names=0&state=0&appid= Did you reset the project (more than once?) without reporting the tasks (even failed or cancelled ones) prior to the reset? 7 CMS tasks running concurrently would allocate a bit more than 14 GB RAM. Your computer has 32 GB - more than enough as long as other processes don't require all of that. Nonetheless, the BOINC settings should be checked to ensure the upper limits there are high enough. You currently have 1 successful CMS task: https://lhcathome.cern.ch/lhcathome/result.php?resultid=334918754 Even that one shows some trouble: [pre]2021-12-14 19:37:26 (4012): Guest Log: [INFO] CMS application starting. Check log files. 2021-12-14 19:46:28 (4012): VM state change detected. (old = 'running', new = 'paused') 2021-12-14 19:46:51 (4012): VM state change detected. (old = 'paused', new = 'running') 2021-12-14 20:53:13 (4012): VM state change detected. (old = 'running', new = 'paused') 2021-12-14 21:05:26 (4012): VM state change detected. (old = 'paused', new = 'running') 2021-12-14 21:06:02 (4012): VM state change detected. (old = 'running', new = 'paused') 2021-12-14 21:15:38 (4012): VM state change detected. (old = 'paused', new = 'running')[/pre] The 1st suspend happens just a few minutes after "CMS application starting". That's a critical phase where CMS connects to WMAgent/HTCondor to get a subtask. A paused VM won't finish the communication with the project servers and they would mark the subtask as lost. When your VM resumes it will try to continue the communication the server has already dropped. In this case the suspend may luckily have happened without disturbing a critical connection. In addition that's the phase where the VM allocates lots of additional RAM. The latter might be the reason why BOINC suspends the task. The resume following just a few seconds later points into the same direction. Hence the suggestion to check the RAM limits. https://lhcathome.cern.ch/lhcathome/result.php?resultid=334918401 [pre]2021-12-03 00:49:35 (11232): Guest Log: [INFO] CMS application starting. Check log files. 2021-12-03 01:48:12 (11232): VM state change detected. (old = 'running', new = 'paused') 2021-12-04 22:33:53 (11232): Stopping VM. 2021-12-04 22:33:53 (11232): Error in stop VM for VM: -108[/pre] It appears that this task was running fine until 2021-12-03 01:48:12. The next log entry was made far more than 24h later whereas the maximum time limit is 18h. At least 4 other tasks show similar lines, all were suspended within the same minute. Depending on the BOINC settings suspending a task causes a snapshot of the complete VM RAM (2 GB) to be written to the disk. Writing 5x2 GB (or more) to the disk puts a huge load on the I/O system. If possible, this should not be started concurrently. ... I had no issues for years on this until now. CMS ran fine before. ... At least 1 of your Cosmology tasks (also on vbox) shows a similar error: http://www.cosmologyathome.org/result.php?resultid=51292341 ID: 45868 · Reply Quote