Message boards : Number crunching : CMS WU's died after pausing.
Message board moderation

To post messages, you must log in.

AuthorMessage
marmot
Avatar

Send message
Joined: 5 Nov 15
Posts: 119
Credit: 5,250,392
RAC: 1
Message 32602 - Posted: 3 Oct 2017, 7:58:12 UTC

Was running 8 single core CMS WU's and had some ATLAS d/l and once a CMS WU completed, the other 7 paused and an ATLAS 8-core started.

The CMS work needed to complete in order to get benchmarking data today, so I paused all the ATLAS WU's, and the CMS wrappers started coming back into RAM.

The problem was, in VBOX Manager, 6 of 7 CMS VM's showed as disconnected (exclamation mark icons).

After a few minutes,the newest CMS WU's was the only one still running and the others died. I aborted the last and unpaused the ATLAS WU's which survived being paused.

My BOINC installation is not in the default location. It is on a D:\BOINC on a Windows 8.1 OS so that it can be easily ported to other computers. VBox is installed in the default directory.


I'll have to remember never to pause WU's but this problem happened last year on vLHC work. Didn't expect the same problem with pausing to still occur. At least the ATLAS WU was paused for a few minutes without incident.
ID: 32602 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 465
Credit: 3,327,099
RAC: 1,438
Message 32603 - Posted: 3 Oct 2017, 8:07:53 UTC

CMS-tasks don't like pausing for a longer period let's say an hour.
In best case a new job inside the VM starts and some time is lost for the incomplete job.
In your case, I suppose all CMS-tasks paused and BOINC wanted to save their states to disk.
BOINC did not manage to do that within 1 minute and several VM's, except one got a stopped state and were not able to restore properly.
ID: 32603 · Report as offensive     Reply Quote
marmot
Avatar

Send message
Joined: 5 Nov 15
Posts: 119
Credit: 5,250,392
RAC: 1
Message 32606 - Posted: 3 Oct 2017, 8:53:09 UTC - in response to Message 32603.  

CMS-tasks don't like pausing for a longer period let's say an hour.
In best case a new job inside the VM starts and some time is lost for the incomplete job.
In your case, I suppose all CMS-tasks paused and BOINC wanted to save their states to disk.
BOINC did not manage to do that within 1 minute and several VM's, except one got a stopped state and were not able to restore properly.



Oh, I remember this better now.

The 1 minute limit and my old computer had 1998 manufacturing date IDE drives.

This is a newer machine with SAS RAID controller and 90x better access rates (3GB rate Hitachi 143 GB SAS).
It shouldn't be a drive communication speed issue preventing the save states in under 60 seconds.

There were 4 cores of 16 available for VBOX to perform the save and the other 12 cores are running low intensity hard drive I/O WU's.

ATLAS had ran for 3+ hours so CMS were all paused 3+ hours.

So I need to try and recreate the problem if this bothers me or... just not worry about it.
ID: 32606 · Report as offensive     Reply Quote
marmot
Avatar

Send message
Joined: 5 Nov 15
Posts: 119
Credit: 5,250,392
RAC: 1
Message 32682 - Posted: 8 Oct 2017, 3:19:16 UTC - in response to Message 32603.  

Watched 8 WU saving to disk after pausing in BOINC client.

Even with the new RAID 5 array, one of the WU took longer than a minute to save and errored out while 24 Universe WU's were in RAM.

So even with equipment from 3 years ago, there is a potential to not meet that deadline.
ID: 32682 · Report as offensive     Reply Quote
marmot
Avatar

Send message
Joined: 5 Nov 15
Posts: 119
Credit: 5,250,392
RAC: 1
Message 32816 - Posted: 13 Oct 2017, 0:45:20 UTC - in response to Message 32682.  

The Storage Spaces software RAID driver used by MS bench marks very nicely but once under a full load of LHC VM's (Rosetta's and POGS WU's also cause issues) the OS can't keep up with the parity writes and performances declines sharply.

Reconfigured this machine to a 3 drive RAID 0 on the LSI hardware RAID controller and it's performing immensely better.

Still, pausing 8 VM's all at once, doesn't make the 60 second time limit. Manual suspending WU 2 at a time is still required.

Really would be nice if the 60 second limit was raised to 180.
ID: 32816 · Report as offensive     Reply Quote

Message boards : Number crunching : CMS WU's died after pausing.


©2018 CERN