Message boards : CMS Application : Exit status 194 (0x000000C2) EXIT_ABORTED_BY_CLIENT
Message board moderation

To post messages, you must log in.

AuthorMessage
Jim1348

Send message
Joined: 15 Nov 14
Posts: 416
Credit: 11,880,818
RAC: 2,982
Message 38515 - Posted: 2 Apr 2019, 22:39:51 UTC
Last modified: 2 Apr 2019, 22:43:40 UTC

After running perfectly for a week, CMS is now throwing only errors.
https://lhcathome.cern.ch/lhcathome/results.php?hostid=10590338

This is an Ubuntu 16.04 machine with VirtualBox 5.1.38. It runs 24/7 and has plenty (32 GB) of memory.
CMS is the only CPU job running, on seven cores of an i7-3770. I will place it on no new work until fixed.

NOTE: All of these tasks failed at exactly the same time, after running for various times ranging from 1/2 hour to 5 1/2 hours. There must be a common cause.
ID: 38515 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jun 08
Posts: 1428
Credit: 73,048,964
RAC: 106,516
Message 38520 - Posted: 3 Apr 2019, 6:11:02 UTC - in response to Message 38515.  

After running perfectly for a week, CMS is now throwing only errors.
https://lhcathome.cern.ch/lhcathome/results.php?hostid=10590338

This is an Ubuntu 16.04 machine with VirtualBox 5.1.38. It runs 24/7 and has plenty (32 GB) of memory.
CMS is the only CPU job running, on seven cores of an i7-3770. I will place it on no new work until fixed.

NOTE: All of these tasks failed at exactly the same time, after running for various times ranging from 1/2 hour to 5 1/2 hours. There must be a common cause.

Strange.
All VMs crashed within 1 s.

The logs show the usual crash info but I can't find an entry that points out the original reason for the crash.
Checked my own systems, a few from other volunteers and the CMS web stats but found no obvious errors.

I guess it was a local problem.
Before you start again you may check/clean the VBox environment.
ID: 38520 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Nov 14
Posts: 416
Credit: 11,880,818
RAC: 2,982
Message 38524 - Posted: 3 Apr 2019, 10:41:33 UTC - in response to Message 38520.  
Last modified: 3 Apr 2019, 11:25:21 UTC

I guess it was a local problem.
Before you start again you may check/clean the VBox environment.

Thanks. That was my feeling too, but I don't know what it could be. There were two other CMS waiting to run in the buffer. One has now completed successfully after 12 hours, and the other one will complete in a few more hours. Whatever it was has gone away.

EDIT: I only see some empty slots, but they disappeared when I did an Ubuntu update. Otherwise BOINC looks normal to me.
ID: 38524 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Nov 14
Posts: 416
Credit: 11,880,818
RAC: 2,982
Message 38535 - Posted: 5 Apr 2019, 13:18:52 UTC
Last modified: 5 Apr 2019, 13:55:48 UTC

It happened again. All eight running CMS failed at exactly the same time.
https://lhcathome.cern.ch/lhcathome/results.php?hostid=10590338

Maybe bad memory, or worse. I will check it later, but in the meantime just run Cosmology which works fine on that machine.
One physical process is as good as another in a pinch.

EDIT: It seems to happen only when I am running eight CMS at once. Even though I have 32 GB memory, I reserve 12 GB for a write-cache, leaving only 20 GB. Linux is supposed to free the cache memory for use by applications when needed, but maybe it doesn't.
If I limit CMS to four work units at once, it might be more reliable.
ID: 38535 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jun 08
Posts: 1428
Credit: 73,048,964
RAC: 106,516
Message 38537 - Posted: 5 Apr 2019, 14:16:53 UTC - in response to Message 38535.  

All of that began around 08:10:30 when the BOINC client sent a stop signal to all running VMs.
2019-04-05 08:10:31 (30603): VM state change detected. (old = 'running', new = 'paused')

The crash could have been a result of the fact that the system tried to write all images to the disk concurrently.

Maybe the BOINC client hit a limit, e.g. RAM/swap usage.
Could you check this in your BOINC client preferences?

Beside that you may try VirtualBox 6.0.4.
At least on linux it runs very stable.


I reserve 12 GB for a write-cache, leaving only 20 GB. Linux is supposed to free the cache memory for use by applications when needed, but maybe it doesn't.

I think it will at least try to free the RAM but maybe this takes too long and the BOINC client meanwhile stops the VMs.
That would make the situation worse.

Did you also adjust swappiness?
And, do you have enough swap space?
ID: 38537 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Nov 14
Posts: 416
Credit: 11,880,818
RAC: 2,982
Message 38539 - Posted: 5 Apr 2019, 14:53:20 UTC - in response to Message 38537.  
Last modified: 5 Apr 2019, 14:53:45 UTC

Maybe the BOINC client hit a limit, e.g. RAM/swap usage.
Could you check this in your BOINC client preferences?

I always set them to 95%.


I think it will at least try to free the RAM but maybe this takes too long and the BOINC client meanwhile stops the VMs.
That would make the situation worse.

I am reducing the write-cache to 4 GB, and also limiting CMS to four at a time, which is what I want anyway.

Did you also adjust swappiness?
And, do you have enough swap space?

I always set swappiness to 0, but I create 2 GB swap space anyway. It is rarely used.

I expect that will fix it. I just use the default repository VBox for convenience. It should work.
ID: 38539 · Report as offensive     Reply Quote

Message boards : CMS Application : Exit status 194 (0x000000C2) EXIT_ABORTED_BY_CLIENT


©2020 CERN