Message boards : CMS Application : Exit status 194 (0x000000C2) EXIT_ABORTED_BY_CLIENT
Message board moderation

To post messages, you must log in.

AuthorMessage
Jim1348

Send message
Joined: 15 Nov 14
Posts: 570
Credit: 18,005,801
RAC: 22,619
Message 38515 - Posted: 2 Apr 2019, 22:39:51 UTC
Last modified: 2 Apr 2019, 22:43:40 UTC

After running perfectly for a week, CMS is now throwing only errors.
https://lhcathome.cern.ch/lhcathome/results.php?hostid=10590338

This is an Ubuntu 16.04 machine with VirtualBox 5.1.38. It runs 24/7 and has plenty (32 GB) of memory.
CMS is the only CPU job running, on seven cores of an i7-3770. I will place it on no new work until fixed.

NOTE: All of these tasks failed at exactly the same time, after running for various times ranging from 1/2 hour to 5 1/2 hours. There must be a common cause.
ID: 38515 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jun 08
Posts: 1824
Credit: 123,728,145
RAC: 85,782
Message 38520 - Posted: 3 Apr 2019, 6:11:02 UTC - in response to Message 38515.  

After running perfectly for a week, CMS is now throwing only errors.
https://lhcathome.cern.ch/lhcathome/results.php?hostid=10590338

This is an Ubuntu 16.04 machine with VirtualBox 5.1.38. It runs 24/7 and has plenty (32 GB) of memory.
CMS is the only CPU job running, on seven cores of an i7-3770. I will place it on no new work until fixed.

NOTE: All of these tasks failed at exactly the same time, after running for various times ranging from 1/2 hour to 5 1/2 hours. There must be a common cause.

Strange.
All VMs crashed within 1 s.

The logs show the usual crash info but I can't find an entry that points out the original reason for the crash.
Checked my own systems, a few from other volunteers and the CMS web stats but found no obvious errors.

I guess it was a local problem.
Before you start again you may check/clean the VBox environment.
ID: 38520 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Nov 14
Posts: 570
Credit: 18,005,801
RAC: 22,619
Message 38524 - Posted: 3 Apr 2019, 10:41:33 UTC - in response to Message 38520.  
Last modified: 3 Apr 2019, 11:25:21 UTC

I guess it was a local problem.
Before you start again you may check/clean the VBox environment.

Thanks. That was my feeling too, but I don't know what it could be. There were two other CMS waiting to run in the buffer. One has now completed successfully after 12 hours, and the other one will complete in a few more hours. Whatever it was has gone away.

EDIT: I only see some empty slots, but they disappeared when I did an Ubuntu update. Otherwise BOINC looks normal to me.
ID: 38524 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Nov 14
Posts: 570
Credit: 18,005,801
RAC: 22,619
Message 38535 - Posted: 5 Apr 2019, 13:18:52 UTC
Last modified: 5 Apr 2019, 13:55:48 UTC

It happened again. All eight running CMS failed at exactly the same time.
https://lhcathome.cern.ch/lhcathome/results.php?hostid=10590338

Maybe bad memory, or worse. I will check it later, but in the meantime just run Cosmology which works fine on that machine.
One physical process is as good as another in a pinch.

EDIT: It seems to happen only when I am running eight CMS at once. Even though I have 32 GB memory, I reserve 12 GB for a write-cache, leaving only 20 GB. Linux is supposed to free the cache memory for use by applications when needed, but maybe it doesn't.
If I limit CMS to four work units at once, it might be more reliable.
ID: 38535 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jun 08
Posts: 1824
Credit: 123,728,145
RAC: 85,782
Message 38537 - Posted: 5 Apr 2019, 14:16:53 UTC - in response to Message 38535.  

All of that began around 08:10:30 when the BOINC client sent a stop signal to all running VMs.
2019-04-05 08:10:31 (30603): VM state change detected. (old = 'running', new = 'paused')

The crash could have been a result of the fact that the system tried to write all images to the disk concurrently.

Maybe the BOINC client hit a limit, e.g. RAM/swap usage.
Could you check this in your BOINC client preferences?

Beside that you may try VirtualBox 6.0.4.
At least on linux it runs very stable.


I reserve 12 GB for a write-cache, leaving only 20 GB. Linux is supposed to free the cache memory for use by applications when needed, but maybe it doesn't.

I think it will at least try to free the RAM but maybe this takes too long and the BOINC client meanwhile stops the VMs.
That would make the situation worse.

Did you also adjust swappiness?
And, do you have enough swap space?
ID: 38537 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Nov 14
Posts: 570
Credit: 18,005,801
RAC: 22,619
Message 38539 - Posted: 5 Apr 2019, 14:53:20 UTC - in response to Message 38537.  
Last modified: 5 Apr 2019, 14:53:45 UTC

Maybe the BOINC client hit a limit, e.g. RAM/swap usage.
Could you check this in your BOINC client preferences?

I always set them to 95%.


I think it will at least try to free the RAM but maybe this takes too long and the BOINC client meanwhile stops the VMs.
That would make the situation worse.

I am reducing the write-cache to 4 GB, and also limiting CMS to four at a time, which is what I want anyway.

Did you also adjust swappiness?
And, do you have enough swap space?

I always set swappiness to 0, but I create 2 GB swap space anyway. It is rarely used.

I expect that will fix it. I just use the default repository VBox for convenience. It should work.
ID: 38539 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 1310
Credit: 39,791,492
RAC: 18,247
Message 45054 - Posted: 15 Jun 2021, 9:57:09 UTC

Heartbeat-Error on a CMS-Task.
Serverstatus Abgeschlossen
Resultat Berechnungsfehler
Clientstatus Berechnungsfehler
Endstatus 194 (0x000000C2) EXIT_ABORTED_BY_CLIENT
Computer ID 10409041
Laufzeit 11 Stunden 6 min. 52 sek.
CPU Zeit 10 Stunden 48 min. 43 sek.
Pr├╝fungsstatus Ung├╝ltig

Server-Disk (ATA) is too slow for two CMS-Tasks parallel.
2021-06-15 09:22:09 (4980): Status Report: Job Duration: '64800.000000'
2021-06-15 09:22:09 (4980): Status Report: Elapsed Time: '36000.000000'
2021-06-15 09:22:09 (4980): Status Report: CPU Time: '35291.765625'
2021-06-15 10:22:15 (4980): VM Heartbeat file specified, but missing heartbeat.
2021-06-15 10:22:15 (4980): Powering off VM.

https://lhcathome.cern.ch/lhcathome/results.php?hostid=10409041

This Computer AMD A8-3850 APU with Radeon(tm) HD Graphics [Family 18 Model 1 Stepping 0]
running well for now 16 Years with ONE CMS-Task and other projects!
ID: 45054 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 1310
Credit: 39,791,492
RAC: 18,247
Message 45065 - Posted: 17 Jun 2021, 8:34:41 UTC - in response to Message 45054.  


running well for now 16 Years with ONE CMS-Task and other projects!

Correction!:
The next two CMS-Tasks stopped also with heartbeat Error.
Only one in the last three days finished successful.
Stopped now this Computer, until there is a solution therefore.
https://lhcathome.cern.ch/lhcathome/results.php?hostid=10409041
This PC have a direct Connection to a LAN-Port of the router.
2021-06-15 12:02:46 (272): Detected: VirtualBox VboxManage Interface (Version: 6.1.12)
2021-06-15 12:02:46 (272): Detected: Heartbeat check (file: 'heartbeat' every 1200.000000 seconds)
ID: 45065 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Nov 14
Posts: 570
Credit: 18,005,801
RAC: 22,619
Message 45066 - Posted: 17 Jun 2021, 17:34:31 UTC - in response to Message 45065.  

The next two CMS-Tasks stopped also with heartbeat Error.
Only one in the last three days finished successful.

On Windows 10, you have better give up on VirtualBox 6.x.x.
Try 5.2.44.
https://www.virtualbox.org/wiki/Download_Old_Builds_5_2
ID: 45066 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 1310
Credit: 39,791,492
RAC: 18,247
Message 45067 - Posted: 18 Jun 2021, 8:49:49 UTC - in response to Message 45066.  

Have installed Virtualbox 5.2.44 for this 16-Year old PC yesterday.
First CMS is finished successful atm.
Will see if two CMS are possible in the next days.
The Disk is only ATA and not SATA.
https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10409041
Maybe this is the reason for the vboxwrapper to take a Heartbeat-Error.
ID: 45067 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Nov 14
Posts: 570
Credit: 18,005,801
RAC: 22,619
Message 45072 - Posted: 18 Jun 2021, 23:08:12 UTC - in response to Message 45067.  

The Disk is only ATA and not SATA.
https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10409041
Maybe this is the reason for the vboxwrapper to take a Heartbeat-Error.

That is quite possible too. The last time I used an ATA drive, I had to use a ramdisk to eliminate the errors on CPDN.
The mechanical disk could not keep up with the writes.
ID: 45072 · Report as offensive     Reply Quote

Message boards : CMS Application : Exit status 194 (0x000000C2) EXIT_ABORTED_BY_CLIENT


©2021 CERN