Message boards :
Number crunching :
CMS WU's died after pausing.
Message board moderation
Author | Message |
---|---|
Send message Joined: 5 Nov 15 Posts: 144 Credit: 6,301,268 RAC: 0 |
Was running 8 single core CMS WU's and had some ATLAS d/l and once a CMS WU completed, the other 7 paused and an ATLAS 8-core started. The CMS work needed to complete in order to get benchmarking data today, so I paused all the ATLAS WU's, and the CMS wrappers started coming back into RAM. The problem was, in VBOX Manager, 6 of 7 CMS VM's showed as disconnected (exclamation mark icons). After a few minutes,the newest CMS WU's was the only one still running and the others died. I aborted the last and unpaused the ATLAS WU's which survived being paused. My BOINC installation is not in the default location. It is on a D:\BOINC on a Windows 8.1 OS so that it can be easily ported to other computers. VBox is installed in the default directory. I'll have to remember never to pause WU's but this problem happened last year on vLHC work. Didn't expect the same problem with pausing to still occur. At least the ATLAS WU was paused for a few minutes without incident. |
Send message Joined: 14 Jan 10 Posts: 1280 Credit: 8,496,817 RAC: 2,374 |
CMS-tasks don't like pausing for a longer period let's say an hour. In best case a new job inside the VM starts and some time is lost for the incomplete job. In your case, I suppose all CMS-tasks paused and BOINC wanted to save their states to disk. BOINC did not manage to do that within 1 minute and several VM's, except one got a stopped state and were not able to restore properly. |
Send message Joined: 5 Nov 15 Posts: 144 Credit: 6,301,268 RAC: 0 |
CMS-tasks don't like pausing for a longer period let's say an hour. Oh, I remember this better now. The 1 minute limit and my old computer had 1998 manufacturing date IDE drives. This is a newer machine with SAS RAID controller and 90x better access rates (3GB rate Hitachi 143 GB SAS). It shouldn't be a drive communication speed issue preventing the save states in under 60 seconds. There were 4 cores of 16 available for VBOX to perform the save and the other 12 cores are running low intensity hard drive I/O WU's. ATLAS had ran for 3+ hours so CMS were all paused 3+ hours. So I need to try and recreate the problem if this bothers me or... just not worry about it. |
Send message Joined: 5 Nov 15 Posts: 144 Credit: 6,301,268 RAC: 0 |
Watched 8 WU saving to disk after pausing in BOINC client. Even with the new RAID 5 array, one of the WU took longer than a minute to save and errored out while 24 Universe WU's were in RAM. So even with equipment from 3 years ago, there is a potential to not meet that deadline. |
Send message Joined: 5 Nov 15 Posts: 144 Credit: 6,301,268 RAC: 0 |
The Storage Spaces software RAID driver used by MS bench marks very nicely but once under a full load of LHC VM's (Rosetta's and POGS WU's also cause issues) the OS can't keep up with the parity writes and performances declines sharply. Reconfigured this machine to a 3 drive RAID 0 on the LSI hardware RAID controller and it's performing immensely better. Still, pausing 8 VM's all at once, doesn't make the 60 second time limit. Manual suspending WU 2 at a time is still required. Really would be nice if the 60 second limit was raised to 180. |
©2024 CERN