Message boards : CMS Application : CMS&Atlas host disk problem
Message board moderation

To post messages, you must log in.


Send message
Joined: 28 Nov 08
Posts: 20
Credit: 9,778,530
RAC: 3,804
Message 41834 - Posted: 6 Mar 2020, 16:06:07 UTC


I still have problems with LHC@Home. Mainly with Atlas and CMS VBox tasks. The problem lies in how a combination of Boinc and LHC tasks works with different disks.
Computer ID: 10570926
8 processors allowed (meaning 8 simultaneous tasks with single processor)

Here's what I've found so far. I broke down the whole process in some steps:
1. Boinc contacts the server and downloads tasks (in case of LHC it downloads many tasks - like 8 or so - at the same time)
2. Boinc starts the task or tasks (depending if they are multi-threaded or not)
3. the LHC first copies the disk image to the BOINC/slots/ directory
4. after image is copied it registers a VM in VBox Manager and sets up parameters (base memory, processors, attaches disks etc)
5. VM starts the boot-up process
6. VM starts and does it's work
7. VM finishes the work and the VM shuts-down
8. after VM shuts down there is an extra 5-6 min that I don't know exactly what's going on (there's very little CPU activity but no disk nor ethernet activity... I think some kind of result preparation?)
9. then follows VM deregistration from VBox Manager and a computational error comes up in Boinc Manager (this error is not so important right now)
10. reporting result to the LHC server

In my case:
step 1 is not critical as the internet connection is slower than disk data speed
step 2 - after jobs downloaded Boinc Manager started 8 CMS tasks at the same time (see below for detailed analysis)
Atlas disk image is around 2,54 GB, CMS disk image is around 2,8 GB.
Starting eight Atlas or CMS jobs at the same time is not advisable in my case as writing 8 VM disk images to BOINC/slots/ directory completly overwhelms the disk for a long time. The disk cannot handle so many write requests.
As is seen below different disks have different write queues. SSDs and even SD cards can handle 8 write requests, but HDDs cannot.
Is it possible to do one of the following:
    increase a time-out during VM boot-up process. When VM starts (step 5 above) it looks for a boot disk. If the disk is not there or for some reason not yet ready (host disk still busy with write operations) the VM ends up in rescue console with an error message "Unable to mount root device /dev/disk/by-label/UROOT!" Longer time-out would avoid this situation.


    introduce a parameter and a mechanism in Boinc that would start VBox tasks with a certain delay (step 2 above). This would allow the host disk to finish write operations and when VM is starting also boot disk would be ready.

The situation described here is not only in case of starting new jobs but also when Boinc switches between jobs. The situation arised when Boinc switched from Milkyway@home N-Body Simulation 1.76 (mt) 8 CPU task to 8 single CPU CMS tasks.
Or when Boinc is switching from 8 single CPU CMS to one 8 CPU task. In this case VBox needs to pause 8 VMs and again the disk is active 100% of time for a long period.

The below CMS tasks are listed in a sequence how Boinc started them.

CMS - 2 VMs started at the same time, 2 failed tasks - BOINC/slots/ on HDD
HDD: WD7500AADS-00M2B0 (SATA2, 3Gbps), write speed 60MBps

CMS - 8 VMs started at the same time, 5 failed - BOINC/slots/ on SATA HDD
HDD: WD7500BPKX-75HPJT0 (SATA3, 6GBps), write speed between 60-80MBps, 5:30min 100% active time
CMS_66148_1583476897.820370_0 - run
CMS_77654_1583478099.044842_0 - failed
CMS_95128_1583479600.855668_0 - failed
CMS_136087_1583483505.320420_0 - failed
CMS_121063_1583482003.641115_0 - failed
CMS_153320_1583485307.708883_0 - failed
CMS_199719_1583489516.247794_0 - run
CMS_180480_1583487712.076585_0 - run
Failed VMs ended up in rescue console.
Failed VMs were later reset in VBox manager and did run around 18min. 5 minutes later (23 min from start) there was Computation error in Boinc Manager.

CMS - 8 VMs started at the same time, 7 failed - BOINC/slots/ on SATA HDD
HDD: WD7500BPKX-75HPJT0 (SATA3, 6GBps), write speed between 60-80MBps, 6:15min 100% active time
CMS_139026_1583483805.806430_0 - failed
CMS_196459_1583489215.997645_0 - failed
CMS_196461_1583489216.009555_0 - failed
CMS_124781_1583482304.049943_0 - failed
CMS_150047_1583485007.485152_0 - failed
CMS_139018_1583483805.751000_0 - failed
CMS_150051_1583485007.517990_0 - failed
CMS_171598_1583486810.747297_0 - run
Failed VMs ended up in rescue console and were left in that state.
After 20min from start they were cancelled (by Boinc?LHC?) and after 25min from start there were Computation Errors in Boinc Manager.

CMS - 8 VMs started at the same time, 8 non failed tasks - BOINC/slots/ on USB SD/MMC card
SD card is SanDisk Extreme PRO microSDXC UHS-I, 128GB (up-to 90MBps write speed, up-to 170MBps read speed)
USB 3.0 (5Gbps), the card didn't show up in Windows Task Manager so I couldn't measure times or write speeds.

start-up sequence of above tasks was linear, tasks started one after the other with aprox.60 sec from one start to the next.

CMS - 8 VMs started at the same time, 8 non failed - BOINC/slots/ on SATA3 SSD
SSD: Samsung SSD 850 EVO (SATA3, 6Gbps), write speed 500MBps, around 60-75s 100% active time
tasks run around 13 min and then VMs powered off
after around 18 min from start Computation Error in Boinc Manager

ID: 41834 · Report as offensive     Reply Quote
Volunteer moderator
Volunteer developer
Volunteer tester

Send message
Joined: 15 Jun 08
Posts: 2141
Credit: 175,405,653
RAC: 104,270
Message 41856 - Posted: 9 Mar 2020, 12:42:44 UTC - in response to Message 41834.  

To avoid overloading your disk IO you should:

1. Not start lots of VMs concurrently.
Instead start only one at a time.

2. Not stop lots of VMs concurrently.
Instead stop only one at a time.

3. Avoid context switches that automatically lead to (1.) or (2.).

Since the BOINC client doesn't support it, (1.) and (2.) have to be done done manually or by a (self made) script.
The client's behavior in case of (3.) might be better if multithreaded tasks, e.g. N-Body Simulation, are configured to use less cores.

A more complex solution would be to run separate BOINC clients for load critical projects.
ID: 41856 · Report as offensive     Reply Quote

Message boards : CMS Application : CMS&Atlas host disk problem

©2023 CERN