Message boards : CMS Application : CMS&Atlas host disk problem
Message board moderation

To post messages, you must log in.

AuthorMessage
broz69

Send message
Joined: 28 Nov 08
Posts: 30
Credit: 14,614,135
RAC: 17,776
Message 41834 - Posted: 6 Mar 2020, 16:06:07 UTC

Hello,

I still have problems with LHC@Home. Mainly with Atlas and CMS VBox tasks. The problem lies in how a combination of Boinc and LHC tasks works with different disks.
Computer ID: 10570926
8 processors allowed (meaning 8 simultaneous tasks with single processor)

Here's what I've found so far. I broke down the whole process in some steps:
1. Boinc contacts the server and downloads tasks (in case of LHC it downloads many tasks - like 8 or so - at the same time)
2. Boinc starts the task or tasks (depending if they are multi-threaded or not)
3. the LHC first copies the disk image to the BOINC/slots/ directory
4. after image is copied it registers a VM in VBox Manager and sets up parameters (base memory, processors, attaches disks etc)
5. VM starts the boot-up process
6. VM starts and does it's work
7. VM finishes the work and the VM shuts-down
8. after VM shuts down there is an extra 5-6 min that I don't know exactly what's going on (there's very little CPU activity but no disk nor ethernet activity... I think some kind of result preparation?)
9. then follows VM deregistration from VBox Manager and a computational error comes up in Boinc Manager (this error is not so important right now)
10. reporting result to the LHC server

In my case:
step 1 is not critical as the internet connection is slower than disk data speed
step 2 - after jobs downloaded Boinc Manager started 8 CMS tasks at the same time (see below for detailed analysis)
Atlas disk image is around 2,54 GB, CMS disk image is around 2,8 GB.
Starting eight Atlas or CMS jobs at the same time is not advisable in my case as writing 8 VM disk images to BOINC/slots/ directory completly overwhelms the disk for a long time. The disk cannot handle so many write requests.
As is seen below different disks have different write queues. SSDs and even SD cards can handle 8 write requests, but HDDs cannot.
Is it possible to do one of the following:
    increase a time-out during VM boot-up process. When VM starts (step 5 above) it looks for a boot disk. If the disk is not there or for some reason not yet ready (host disk still busy with write operations) the VM ends up in rescue console with an error message "Unable to mount root device /dev/disk/by-label/UROOT!" Longer time-out would avoid this situation.


or

    introduce a parameter and a mechanism in Boinc that would start VBox tasks with a certain delay (step 2 above). This would allow the host disk to finish write operations and when VM is starting also boot disk would be ready.



The situation described here is not only in case of starting new jobs but also when Boinc switches between jobs. The situation arised when Boinc switched from Milkyway@home N-Body Simulation 1.76 (mt) 8 CPU task to 8 single CPU CMS tasks.
Or when Boinc is switching from 8 single CPU CMS to one 8 CPU task. In this case VBox needs to pause 8 VMs and again the disk is active 100% of time for a long period.

The below CMS tasks are listed in a sequence how Boinc started them.

CMS - 2 VMs started at the same time, 2 failed tasks - BOINC/slots/ on HDD
HDD: WD7500AADS-00M2B0 (SATA2, 3Gbps), write speed 60MBps
CMS_3945628_1583445017.758018_0
CMS_3945649_1583445017.880033_0

CMS - 8 VMs started at the same time, 5 failed - BOINC/slots/ on SATA HDD
HDD: WD7500BPKX-75HPJT0 (SATA3, 6GBps), write speed between 60-80MBps, 5:30min 100% active time
CMS_66148_1583476897.820370_0 - run
CMS_77654_1583478099.044842_0 - failed
CMS_95128_1583479600.855668_0 - failed
CMS_136087_1583483505.320420_0 - failed
CMS_121063_1583482003.641115_0 - failed
CMS_153320_1583485307.708883_0 - failed
CMS_199719_1583489516.247794_0 - run
CMS_180480_1583487712.076585_0 - run
Failed VMs ended up in rescue console.
Failed VMs were later reset in VBox manager and did run around 18min. 5 minutes later (23 min from start) there was Computation error in Boinc Manager.

CMS - 8 VMs started at the same time, 7 failed - BOINC/slots/ on SATA HDD
HDD: WD7500BPKX-75HPJT0 (SATA3, 6GBps), write speed between 60-80MBps, 6:15min 100% active time
CMS_139026_1583483805.806430_0 - failed
CMS_196459_1583489215.997645_0 - failed
CMS_196461_1583489216.009555_0 - failed
CMS_124781_1583482304.049943_0 - failed
CMS_150047_1583485007.485152_0 - failed
CMS_139018_1583483805.751000_0 - failed
CMS_150051_1583485007.517990_0 - failed
CMS_171598_1583486810.747297_0 - run
Failed VMs ended up in rescue console and were left in that state.
After 20min from start they were cancelled (by Boinc?LHC?) and after 25min from start there were Computation Errors in Boinc Manager.

CMS - 8 VMs started at the same time, 8 non failed tasks - BOINC/slots/ on USB SD/MMC card
SD card is SanDisk Extreme PRO microSDXC UHS-I, 128GB (up-to 90MBps write speed, up-to 170MBps read speed)
USB 3.0 (5Gbps), the card didn't show up in Windows Task Manager so I couldn't measure times or write speeds.

CMS_3974215_1583448025.207379_0
CMS_3974237_1583448025.327024_0
CMS_3922210_1583442611.886740_0
CMS_3922207_1583442611.871799_0
CMS_3971270_1583447724.078240_0
CMS_3922219_1583442611.906543_0
CMS_3910580_1583441410.016243_0
CMS_113714_1583481402.844904_0
start-up sequence of above tasks was linear, tasks started one after the other with aprox.60 sec from one start to the next.

CMS - 8 VMs started at the same time, 8 non failed - BOINC/slots/ on SATA3 SSD
SSD: Samsung SSD 850 EVO (SATA3, 6Gbps), write speed 500MBps, around 60-75s 100% active time
CMS_100680_1583480201.538717_0
CMS_132622_1583483204.854043_0
CMS_121061_1583482003.629244_0
CMS_139006_1583483805.668606_0
CMS_130340_1583482904.579675_0
CMS_136085_1583483505.295626_0
CMS_117122_1583481703.003336_0
CMS_95126_1583479600.843608_0
tasks run around 13 min and then VMs powered off
after around 18 min from start Computation Error in Boinc Manager

ID: 41834 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,942,387
RAC: 137,359
Message 41856 - Posted: 9 Mar 2020, 12:42:44 UTC - in response to Message 41834.  

To avoid overloading your disk IO you should:

1. Not start lots of VMs concurrently.
Instead start only one at a time.

2. Not stop lots of VMs concurrently.
Instead stop only one at a time.

3. Avoid context switches that automatically lead to (1.) or (2.).


Since the BOINC client doesn't support it, (1.) and (2.) have to be done done manually or by a (self made) script.
The client's behavior in case of (3.) might be better if multithreaded tasks, e.g. N-Body Simulation, are configured to use less cores.

A more complex solution would be to run separate BOINC clients for load critical projects.
ID: 41856 · Report as offensive     Reply Quote

Message boards : CMS Application : CMS&Atlas host disk problem


©2024 CERN