Theory and CMS jobs fail after resume (Win + VBox)

Author	Message
broz69 Send message Joined: 28 Nov 08 Posts: 30 Credit: 14,964,861 RAC: 0	Message 48969 - Posted: 29 Nov 2023, 19:17:08 UTC Last modified: 29 Nov 2023, 19:17:29 UTC Hi, I thought I could share this with you. I am seeing undesired behaviour on two my LHC crunching machines. Both of them Windows 11 with VBox (hostid= 10834815 and hostid=10616627). The situation is as follows: - some Theory and CMS jobs run - BOINC fetches some new work and these tasks happen to be Atlas - when tasks download BOINC decides (probably based on some algorithm connected with deadline) that Atlas jobs have priority - Theory and CMS jobs get either paused either saved in VBox (which is already strange - why some tasks save and others pause?) - Atlas jobs run and finish OK - after Atlas jobs finish, some old jobs continue OK, some not and throw this error (on hostid=10834815 only two tasks finished OK, other 6 threw error): CMS_3858007_1701010523.959683 217385371 --------------------------- VBoxHeadless.exe - Application Error --------------------------- The exception Breakpoint A breakpoint has been reached. (0x80000003) occurred in the application at location 0x00007FFAFFEA4CFF. Click on OK to terminate the program Theory_2390-1127772-964 217376900 --------------------------- VBoxHeadless.exe - Application Error --------------------------- The exception Breakpoint A breakpoint has been reached. (0x80000003) occurred in the application at location 0x00007FFAFFEA4CFF. Click on OK to terminate the program Theory_2390-1104457-964 217377861 --------------------------- VBoxHeadless.exe - Application Error --------------------------- The exception Breakpoint A breakpoint has been reached. (0x80000003) occurred in the application at location 0x00007FFAFFEA4CFF. Click on OK to terminate the program Theory_2390-1115150-964 217378166 --------------------------- VBoxHeadless.exe - Application Error --------------------------- The exception Breakpoint A breakpoint has been reached. (0x80000003) occurred in the application at location 0x00007FFAFFEA4CFF. Click on OK to terminate the program Theory_2390-1122909-968 217391846 --------------------------- VBoxHeadless.exe - Application Error --------------------------- The exception Breakpoint A breakpoint has been reached. (0x80000003) occurred in the application at location 0x00007FFAFFEA4CFF. Click on OK to terminate the program CMS_3856916_1701010223.879077 217385367 --------------------------- VBoxHeadless.exe - Application Error --------------------------- The exception Breakpoint A breakpoint has been reached. (0x80000003) occurred in the application at location 0x00007FFAFFEA4CFF. Click on OK to terminate the program BOINC setting "Leave non-GPU tasks in memory while suspended" = false. I have seen this behaviour for some time now and I'm following it more closely in the last few days and it is always the same (if you check failed jobs for these two hostids they are always because of this switching between different LHC jobs). I'm not using BOINC for any other computation. Linux native seems OK. Best regards. ID: 48969 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1443 Credit: 9,703,721 RAC: 1,548	Message 48970 - Posted: 30 Nov 2023, 11:09:02 UTC You have loaded too many tasks and because ATLAS has shorter deadline than CMS, CMS is paused. ATLAS uses 8 threads on your system and so all Theory and CMS are suspended. When not leaved in memory they all want to write the VM-state too disk. This will last too long and the VM's get corrupt. Advice: Reduce your cache buffer and ask less tasks. Theory will survive when left in memory, but CMS wants an uninterrupted internet connection. ID: 48970 · Reply Quote

broz69 Send message Joined: 28 Nov 08 Posts: 30 Credit: 14,964,861 RAC: 0	Message 48971 - Posted: 30 Nov 2023, 18:39:47 UTC - in response to Message 48970. Last modified: 30 Nov 2023, 18:45:35 UTC Hi, Thank you for this answer. I can't say that I like it. I think that 8 is not a lot of tasks, compared to others that have even more cores... And I was also checking other users with similar configurations and all of them had same issues at some time in the past. So I can't say that my problem is unique. In the mean time I did the following: - I created a RAM disk with ImDisk (RAM speed is 3200MHz) - in Oracle VM VirtualBox Manager I created and installed 8 Linux machines (2GB RAM, 1 core) and saved these 8 virtual disks (dynamic size) to RAM disk - I created a group with all of them - then I run all 8 of them at the same time (as a group) - no problem - I paused all 8 of them - no problem - I resumed all 8 of them - no problem - I saved all 8 of them - no problem - I resumed all 8 of them - one machine was corrupted and did not start correctly The problem after all these steps above was, that Oracle VM VirtualBox Manager was consuming 100% of processor (all 8C/16T at 100%). I had to suspend all VMs and restart the manager. Now back to my problem. If I suspend and then resume CMS job it seems that it continues running OK. I tried it and it's OK. As I see the problem lies in VBox that cannot handle suspend/resume of "large" numbers of VMs or in Boinc that does something that VBox can't handle. Can't we ask Boinc not to suspend/resume VM jobs all at the same time but like in steps of 1 with some delay between them? Or why does Boinc even do this when we know that VBox can't handle suspend/resume of "large" numbers of VMs at the same time? Or why does Boinc even start Atlas jobs when we know that there'll be problems with CMS and Theory jobs running? Or can we ask Oracle to fix VBox Manager? What's your view on this? The problem that I'd like to solve is that when Theory and CMS jobs resume they do that without error. In case of error the time and energy spent on failed jobs is useless and on a top we don't get any credits ;) Best regards. ID: 48971 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1443 Credit: 9,703,721 RAC: 1,548	Message 48972 - Posted: 1 Dec 2023, 8:10:43 UTC - in response to Message 48971. What's your view on this? The problem is in vboxwrapper. In the code is hard-coded that the save to disk should be ready within 30 seconds. Normally with 1 VM this is not a problem, but with 7 VM's at once leads to corrupted VM's. Solutions: 1. Use a profile for only ATLAS-tasks or only Theory/CMS combined. (Evt. 2 profiles for different machines) 2. When you have enough RAM: Keep CPU-tasks in memory when suspending. (The VM will not need to save to disk) 3, Manual save the Theory- and CMS-VM's with interval to disk before starting an ATLAS-task. 4. Reduce the number of threads for an ATLAS-task using app_config.xml Mind you: When an ATLAS-task is ready and Theory/CMS should start 4 or 8 tasks at once at the same time could also lead to problems, so solution 1 seems the best for me. ID: 48972 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2624 Credit: 265,569,879 RAC: 132,567	Message 48973 - Posted: 1 Dec 2023, 8:41:02 UTC - in response to Message 48972. Not really an issue related to vboxwrapper. See here: https://www.virtualbox.org/manual/UserManual.html#ts_config-periodic-flush According to that statement a VM may get problems if disk writes take longer than 15 s. ATM the only solution is to ensure that not too many VMs start/pause/resume concurrently as this puts huge load on the IO system. It helps to "keep tasks in memory" but so far BOINC can't stagger/delay vbox operations. @all Either find a (volunteer!) developer at https://github.com/BOINC/boinc who implements the required functionality or implement it yourself and submit a PR. ID: 48973 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1443 Credit: 9,703,721 RAC: 1,548	Message 48977 - Posted: 3 Dec 2023, 10:50:36 UTC - in response to Message 48973. Not really an issue related to vboxwrapper. I'm sure, that vboxwrapper takes care of VM's not saved to disk fast enough. 'Care' is a nice word for 'Killing' . 2023-12-03 09:50:45 (1452): Stopping VM. 2023-12-03 09:51:31 (1452): Error in stop VM for VM: -182 Command: VBoxManage -q controlvm "boinc_80c6fb5e626f397f" savestate Output: 0%...10%...20%...30%...40%... 2023-12-03 09:51:31 (1452): VM did not stop when requested. 2023-12-03 09:51:31 (1452): VM was successfully terminated. ID: 48977 · Reply Quote

LHC@home