Message boards :
CMS Application :
Exit status 194 (0x000000C2) EXIT_ABORTED_BY_CLIENT
Message board moderation
Author | Message |
---|---|
Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0 |
After running perfectly for a week, CMS is now throwing only errors. https://lhcathome.cern.ch/lhcathome/results.php?hostid=10590338 This is an Ubuntu 16.04 machine with VirtualBox 5.1.38. It runs 24/7 and has plenty (32 GB) of memory. CMS is the only CPU job running, on seven cores of an i7-3770. I will place it on no new work until fixed. NOTE: All of these tasks failed at exactly the same time, after running for various times ranging from 1/2 hour to 5 1/2 hours. There must be a common cause. |
Send message Joined: 15 Jun 08 Posts: 2541 Credit: 254,608,838 RAC: 28,391 |
After running perfectly for a week, CMS is now throwing only errors. Strange. All VMs crashed within 1 s. The logs show the usual crash info but I can't find an entry that points out the original reason for the crash. Checked my own systems, a few from other volunteers and the CMS web stats but found no obvious errors. I guess it was a local problem. Before you start again you may check/clean the VBox environment. |
Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0 |
I guess it was a local problem. Thanks. That was my feeling too, but I don't know what it could be. There were two other CMS waiting to run in the buffer. One has now completed successfully after 12 hours, and the other one will complete in a few more hours. Whatever it was has gone away. EDIT: I only see some empty slots, but they disappeared when I did an Ubuntu update. Otherwise BOINC looks normal to me. |
Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0 |
It happened again. All eight running CMS failed at exactly the same time. https://lhcathome.cern.ch/lhcathome/results.php?hostid=10590338 Maybe bad memory, or worse. I will check it later, but in the meantime just run Cosmology which works fine on that machine. One physical process is as good as another in a pinch. EDIT: It seems to happen only when I am running eight CMS at once. Even though I have 32 GB memory, I reserve 12 GB for a write-cache, leaving only 20 GB. Linux is supposed to free the cache memory for use by applications when needed, but maybe it doesn't. If I limit CMS to four work units at once, it might be more reliable. |
Send message Joined: 15 Jun 08 Posts: 2541 Credit: 254,608,838 RAC: 28,391 |
All of that began around 08:10:30 when the BOINC client sent a stop signal to all running VMs. 2019-04-05 08:10:31 (30603): VM state change detected. (old = 'running', new = 'paused') The crash could have been a result of the fact that the system tried to write all images to the disk concurrently. Maybe the BOINC client hit a limit, e.g. RAM/swap usage. Could you check this in your BOINC client preferences? Beside that you may try VirtualBox 6.0.4. At least on linux it runs very stable. I reserve 12 GB for a write-cache, leaving only 20 GB. Linux is supposed to free the cache memory for use by applications when needed, but maybe it doesn't. I think it will at least try to free the RAM but maybe this takes too long and the BOINC client meanwhile stops the VMs. That would make the situation worse. Did you also adjust swappiness? And, do you have enough swap space? |
Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0 |
Maybe the BOINC client hit a limit, e.g. RAM/swap usage. I always set them to 95%. I think it will at least try to free the RAM but maybe this takes too long and the BOINC client meanwhile stops the VMs. I am reducing the write-cache to 4 GB, and also limiting CMS to four at a time, which is what I want anyway. Did you also adjust swappiness? I always set swappiness to 0, but I create 2 GB swap space anyway. It is rarely used. I expect that will fix it. I just use the default repository VBox for convenience. It should work. |
Send message Joined: 2 May 07 Posts: 2244 Credit: 173,902,375 RAC: 374 |
Heartbeat-Error on a CMS-Task. Serverstatus Abgeschlossen Resultat Berechnungsfehler Clientstatus Berechnungsfehler Endstatus 194 (0x000000C2) EXIT_ABORTED_BY_CLIENT Computer ID 10409041 Laufzeit 11 Stunden 6 min. 52 sek. CPU Zeit 10 Stunden 48 min. 43 sek. Prüfungsstatus Ungültig Server-Disk (ATA) is too slow for two CMS-Tasks parallel. 2021-06-15 09:22:09 (4980): Status Report: Job Duration: '64800.000000' 2021-06-15 09:22:09 (4980): Status Report: Elapsed Time: '36000.000000' 2021-06-15 09:22:09 (4980): Status Report: CPU Time: '35291.765625' 2021-06-15 10:22:15 (4980): VM Heartbeat file specified, but missing heartbeat. 2021-06-15 10:22:15 (4980): Powering off VM. https://lhcathome.cern.ch/lhcathome/results.php?hostid=10409041 This Computer AMD A8-3850 APU with Radeon(tm) HD Graphics [Family 18 Model 1 Stepping 0] running well for now 16 Years with ONE CMS-Task and other projects! |
Send message Joined: 2 May 07 Posts: 2244 Credit: 173,902,375 RAC: 374 |
Correction!: The next two CMS-Tasks stopped also with heartbeat Error. Only one in the last three days finished successful. Stopped now this Computer, until there is a solution therefore. https://lhcathome.cern.ch/lhcathome/results.php?hostid=10409041 This PC have a direct Connection to a LAN-Port of the router. 2021-06-15 12:02:46 (272): Detected: VirtualBox VboxManage Interface (Version: 6.1.12) 2021-06-15 12:02:46 (272): Detected: Heartbeat check (file: 'heartbeat' every 1200.000000 seconds) |
Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0 |
The next two CMS-Tasks stopped also with heartbeat Error. On Windows 10, you have better give up on VirtualBox 6.x.x. Try 5.2.44. https://www.virtualbox.org/wiki/Download_Old_Builds_5_2 |
Send message Joined: 2 May 07 Posts: 2244 Credit: 173,902,375 RAC: 374 |
Have installed Virtualbox 5.2.44 for this 16-Year old PC yesterday. First CMS is finished successful atm. Will see if two CMS are possible in the next days. The Disk is only ATA and not SATA. https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10409041 Maybe this is the reason for the vboxwrapper to take a Heartbeat-Error. |
Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0 |
The Disk is only ATA and not SATA. That is quite possible too. The last time I used an ATA drive, I had to use a ramdisk to eliminate the errors on CPDN. The mechanical disk could not keep up with the writes. |
©2024 CERN