CMS&Atlas host disk problem

Author	Message
broz69 Send message Joined: 28 Nov 08 Posts: 30 Credit: 14,859,718 RAC: 3,169	Message 41834 - Posted: 6 Mar 2020, 16:06:07 UTC Hello, I still have problems with LHC@Home. Mainly with Atlas and CMS VBox tasks. The problem lies in how a combination of Boinc and LHC tasks works with different disks. Computer ID: 10570926 8 processors allowed (meaning 8 simultaneous tasks with single processor) Here's what I've found so far. I broke down the whole process in some steps: 1. Boinc contacts the server and downloads tasks (in case of LHC it downloads many tasks - like 8 or so - at the same time) 2. Boinc starts the task or tasks (depending if they are multi-threaded or not) 3. the LHC first copies the disk image to the BOINC/slots/ directory 4. after image is copied it registers a VM in VBox Manager and sets up parameters (base memory, processors, attaches disks etc) 5. VM starts the boot-up process 6. VM starts and does it's work 7. VM finishes the work and the VM shuts-down 8. after VM shuts down there is an extra 5-6 min that I don't know exactly what's going on (there's very little CPU activity but no disk nor ethernet activity... I think some kind of result preparation?) 9. then follows VM deregistration from VBox Manager and a computational error comes up in Boinc Manager (this error is not so important right now) 10. reporting result to the LHC server In my case: step 1 is not critical as the internet connection is slower than disk data speed step 2 - after jobs downloaded Boinc Manager started 8 CMS tasks at the same time (see below for detailed analysis) Atlas disk image is around 2,54 GB, CMS disk image is around 2,8 GB. Starting eight Atlas or CMS jobs at the same time is not advisable in my case as writing 8 VM disk images to BOINC/slots/ directory completly overwhelms the disk for a long time. The disk cannot handle so many write requests. As is seen below different disks have different write queues. SSDs and even SD cards can handle 8 write requests, but HDDs cannot. Is it possible to do one of the following: increase a time-out during VM boot-up process. When VM starts (step 5 above) it looks for a boot disk. If the disk is not there or for some reason not yet ready (host disk still busy with write operations) the VM ends up in rescue console with an error message "Unable to mount root device /dev/disk/by-label/UROOT!" Longer time-out would avoid this situation. or introduce a parameter and a mechanism in Boinc that would start VBox tasks with a certain delay (step 2 above). This would allow the host disk to finish write operations and when VM is starting also boot disk would be ready. The situation described here is not only in case of starting new jobs but also when Boinc switches between jobs. The situation arised when Boinc switched from Milkyway@home N-Body Simulation 1.76 (mt) 8 CPU task to 8 single CPU CMS tasks. Or when Boinc is switching from 8 single CPU CMS to one 8 CPU task. In this case VBox needs to pause 8 VMs and again the disk is active 100% of time for a long period. The below CMS tasks are listed in a sequence how Boinc started them. CMS - 2 VMs started at the same time, 2 failed tasks - BOINC/slots/ on HDD HDD: WD7500AADS-00M2B0 (SATA2, 3Gbps), write speed 60MBps CMS_3945628_1583445017.758018_0 CMS_3945649_1583445017.880033_0 CMS - 8 VMs started at the same time, 5 failed - BOINC/slots/ on SATA HDD HDD: WD7500BPKX-75HPJT0 (SATA3, 6GBps), write speed between 60-80MBps, 5:30min 100% active time CMS_66148_1583476897.820370_0 - run CMS_77654_1583478099.044842_0 - failed CMS_95128_1583479600.855668_0 - failed CMS_136087_1583483505.320420_0 - failed CMS_121063_1583482003.641115_0 - failed CMS_153320_1583485307.708883_0 - failed CMS_199719_1583489516.247794_0 - run CMS_180480_1583487712.076585_0 - run Failed VMs ended up in rescue console. Failed VMs were later reset in VBox manager and did run around 18min. 5 minutes later (23 min from start) there was Computation error in Boinc Manager. CMS - 8 VMs started at the same time, 7 failed - BOINC/slots/ on SATA HDD HDD: WD7500BPKX-75HPJT0 (SATA3, 6GBps), write speed between 60-80MBps, 6:15min 100% active time CMS_139026_1583483805.806430_0 - failed CMS_196459_1583489215.997645_0 - failed CMS_196461_1583489216.009555_0 - failed CMS_124781_1583482304.049943_0 - failed CMS_150047_1583485007.485152_0 - failed CMS_139018_1583483805.751000_0 - failed CMS_150051_1583485007.517990_0 - failed CMS_171598_1583486810.747297_0 - run Failed VMs ended up in rescue console and were left in that state. After 20min from start they were cancelled (by Boinc?LHC?) and after 25min from start there were Computation Errors in Boinc Manager. CMS - 8 VMs started at the same time, 8 non failed tasks - BOINC/slots/ on USB SD/MMC card SD card is SanDisk Extreme PRO microSDXC UHS-I, 128GB (up-to 90MBps write speed, up-to 170MBps read speed) USB 3.0 (5Gbps), the card didn't show up in Windows Task Manager so I couldn't measure times or write speeds. CMS_3974215_1583448025.207379_0 CMS_3974237_1583448025.327024_0 CMS_3922210_1583442611.886740_0 CMS_3922207_1583442611.871799_0 CMS_3971270_1583447724.078240_0 CMS_3922219_1583442611.906543_0 CMS_3910580_1583441410.016243_0 CMS_113714_1583481402.844904_0 start-up sequence of above tasks was linear, tasks started one after the other with aprox.60 sec from one start to the next. CMS - 8 VMs started at the same time, 8 non failed - BOINC/slots/ on SATA3 SSD SSD: Samsung SSD 850 EVO (SATA3, 6Gbps), write speed 500MBps, around 60-75s 100% active time CMS_100680_1583480201.538717_0 CMS_132622_1583483204.854043_0 CMS_121061_1583482003.629244_0 CMS_139006_1583483805.668606_0 CMS_130340_1583482904.579675_0 CMS_136085_1583483505.295626_0 CMS_117122_1583481703.003336_0 CMS_95126_1583479600.843608_0 tasks run around 13 min and then VMs powered off after around 18 min from start Computation Error in Boinc Manager ID: 41834 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2413 Credit: 226,503,027 RAC: 131,834	Message 41856 - Posted: 9 Mar 2020, 12:42:44 UTC - in response to Message 41834. To avoid overloading your disk IO you should: 1. Not start lots of VMs concurrently. Instead start only one at a time. 2. Not stop lots of VMs concurrently. Instead stop only one at a time. 3. Avoid context switches that automatically lead to (1.) or (2.). Since the BOINC client doesn't support it, (1.) and (2.) have to be done done manually or by a (self made) script. The client's behavior in case of (3.) might be better if multithreaded tasks, e.g. N-Body Simulation, are configured to use less cores. A more complex solution would be to run separate BOINC clients for load critical projects. ID: 41856 · Reply Quote

LHC@home