CMS computation error in 30 seconds every time

Author	Message
Mr P Hucker Send message Joined: 12 Aug 06 Posts: 448 Credit: 12,972,383 RAC: 8,288	Message 47648 - Posted: 6 Jan 2023, 14:40:56 UTC Last modified: 6 Jan 2023, 14:44:43 UTC https://lhcathome.cern.ch/lhcathome/result.php?resultid=375721684 Ryzen 9 3900XT, virtualisation on, VB 7.0.4 with extension pack. I tried clearing out the VB environment (by deleting everything on the left column in the welcome screen). Didn't fix it. I tried deleting files in the LHC folder under projects, but that really annoyed it so I put them back (I thought they might be corrupt and I'd be forcing them to download again). The files were: CMS_2022_09_07_prod.vdi CMS_2022_09_07_prod.xml vboxwrapper_26198ab7_windows_x86_64.exe vboxwrapper_26206_windows_x86_64.exe ID: 47648 · Reply Quote

Mr P Hucker Send message Joined: 12 Aug 06 Posts: 448 Credit: 12,972,383 RAC: 8,288	Message 47649 - Posted: 6 Jan 2023, 14:47:50 UTC It's happening on all my very different machines. Although they all have the same (latest as of a few weeks ago) VB installed. I've disabled CMS for now until someone can interpret the error message. I was running them ok recently on most machines, and the only change I can think of is updating VB from 5 to 7. ID: 47649 · Reply Quote

Mr P Hucker Send message Joined: 12 Aug 06 Posts: 448 Credit: 12,972,383 RAC: 8,288	Message 47650 - Posted: 6 Jan 2023, 14:58:09 UTC Actually one of eight machines runs them ok. I can't think why that one is any different. This one works: https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10819977 This one does not work: https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10772281 ID: 47650 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2276 Credit: 178,144,372 RAC: 113,228	Message 47651 - Posted: 6 Jan 2023, 15:37:53 UTC - in response to Message 47650. 2023-01-06 14:36:39 (28516): Adding virtual disk drive to VM. (CMS_2022_09_07_prod.vdi) 2023-01-06 14:36:46 (28516): Error in deregister parent vdi for VM: -2135228404 Command: VBoxManage -q closemedium "C:\programdata\BOINC/projects/lhcathome.cern.ch_lhcathome/CMS_2022_09_07_prod.vdi" Output: VBoxManage.exe: error: Cannot close medium 'C:\programdata\BOINC\projects\lhcathome.cern.ch_lhcathome\CMS_2022_09_07_prod.vdi' because it has 4 child media VBoxManage.exe: error: Details: code VBOX_E_OBJECT_IN_USE (0x80bb000c), component MediumWrap, interface IMedium, callee IUnknown VBoxManage.exe: error: Context: "Close()" at line 1862 of file VBoxManageDisk.cpp Have you checked the virtualmedia manager in Virtualbox? ID: 47651 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2679 Credit: 286,751,412 RAC: 84,822	Message 47652 - Posted: 6 Jan 2023, 15:47:09 UTC - in response to Message 47650. None of your machines delivers valid CMS results. Even this one doesn't (CPU times of just a few minutes are much too low): https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10819977 Unfortunately the reason is not reported back to higher script levels, hence BOINC gets a 'success'. As for errors like these: https://lhcathome.cern.ch/lhcathome/result.php?resultid=375708953 1. Shut down BOINC 2. Clean the VirtualBox Medium Registry -> Remove all CMS disks (parent and all related children) 3. Restart BOINC 4. Reset LHC@home to force fresh vdi downloads 5. Start only 1 CMS task (as a 'pilot') to allow Vbox a correct vdi registration 6. Then (after 30-60 s) start other CMS tasks The more you start concurrently the more pressure you put on the disk IO as well as on the LAN/internet. ID: 47652 · Reply Quote

Mr P Hucker Send message Joined: 12 Aug 06 Posts: 448 Credit: 12,972,383 RAC: 8,288	Message 47653 - Posted: 6 Jan 2023, 16:55:14 UTC Last modified: 6 Jan 2023, 17:16:20 UTC My good PC you mentioned is now doing them ok. The ones it's running are at 1.5 hours CPU time so far. Disk IO isn't a problem, half have NVME the other half SSD. I don't use rotary disks except backups and TV/Film/security camera storage. I think last time I ran CMS tasks I was maxing out the uplink on my internet, but shouldn't the tasks just be patient? Between them I was sending about 6.5Mbit successfully (my uplink line speed) I'll attempt to clean the dodgy PCs' VBs in a moment. Maybe upgrading from 5 to 7 left corruptions? ID: 47653 · Reply Quote

Mr P Hucker Send message Joined: 12 Aug 06 Posts: 448 Credit: 12,972,383 RAC: 8,288	Message 47654 - Posted: 6 Jan 2023, 19:53:03 UTC I've fixed one of the PCs following the steps you mentioned, and will now do the same with the others. Thanks. I must have corrupted the VB environment somehow, probably changing it from V6 to V5 to V7. ID: 47654 · Reply Quote

Mr P Hucker Send message Joined: 12 Aug 06 Posts: 448 Credit: 12,972,383 RAC: 8,288	Message 47655 - Posted: 6 Jan 2023, 20:11:21 UTC - in response to Message 47652. Unfortunately the reason is not reported back to higher script levels, hence BOINC gets a 'success'. That's worrying, they're marked as valid in my list of tasks on the server. Does the system later notice a problem and resend those tasks? Or does it royally screw up the science? I will of course be keeping an eye on the run time vs CPU time on the server list of tasks to make sure mine behave and are useful. ID: 47655 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1110 Credit: 9,342,950 RAC: 2,062	Message 47658 - Posted: 11 Jan 2023, 23:13:09 UTC - in response to Message 47655. Unfortunately the reason is not reported back to higher script levels, hence BOINC gets a 'success'. That's worrying, they're marked as valid in my list of tasks on the server. Does the system later notice a problem and resend those tasks? In general, yes. This runs at two levels -- WMAgent produces jobs and sends them to the HTCondor server. Your VM instances (i.e. BOINC tasks) ask the condor server for a job. If that job terminates with an error, or your VM goes out of contact from the server for too long (currently two hours), the condor server requeues it and sends it to a new VM when the queueing allows. If there are several errors for the one job (currently three, IIRC) the condor server notifies the WMAgent which then itself requeues the job for future resubmission back to the condor server. If the job terminates without error, then the VM will ask for another job, up until 12+ hours have elapsed in total. Or does it royally screw up the science? No, if you look at the job graphs from the homepage, you will see 5-10% job failures, but these are the primary failures seen by condor. In the (unfortunately non-public) monitoring we see the ultimate failure rate is essentially zero for every 20,000-job workflow submission. We do tend to be generous, and allow credits for CPU time given even if we could detect a failure -- as alluded to above -- but egregious errors will not get any credit, though. I will of course be keeping an eye on the run time vs CPU time on the server list of tasks to make sure mine behave and are useful. You should, ideally, be seeing task logs that look like mine -- https://lhcathome.cern.ch/lhcathome/results.php?userid=14095 -- tasks running for 12 hours or so with CPU time slightly less. Each task (VM instance) is therefore running 5 or 6 2-hour CMS jobs before terminating to allow BOINC to start a new task. ID: 47658 · Reply Quote

Mr P Hucker Send message Joined: 12 Aug 06 Posts: 448 Credit: 12,972,383 RAC: 8,288	Message 47661 - Posted: 12 Jan 2023, 11:38:39 UTC - in response to Message 47658. All seems to be going well except I notice I'm running short of internet bandwidth. Is CMS different to a year or so ago? It seems to max out my 7 Mbit upload, using only a third of my cores! I'm supposed to get 200 Mbit upload some time this year though - the engineers are digging up neighbouring streets as we speak! ID: 47661 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2679 Credit: 286,751,412 RAC: 84,822	Message 47663 - Posted: 12 Jan 2023, 12:24:40 UTC - in response to Message 47661. Rough estimation: A 7 Mbit/s upload bandwidth will be fully saturated by 50 CMS VMs running concurrently. ID: 47663 · Reply Quote

Mr P Hucker Send message Joined: 12 Aug 06 Posts: 448 Credit: 12,972,383 RAC: 8,288	Message 47664 - Posted: 12 Jan 2023, 13:31:48 UTC - in response to Message 47663. Last modified: 12 Jan 2023, 13:33:19 UTC Rough estimation: A 7 Mbit/s upload bandwidth will be fully saturated by 50 CMS VMs running concurrently. 30. But I guess it depends on CPU speed. I have 126 cores :-/ Did CMS used to work like this? I don't recall having a bandwidth problem before. Why can't it upload them at the end using Boinc in the normal way in the queue instead of stalling the processing? Is there an easy way to get equal numbers of Theory and Atlas aswell? I've asked for anything and only get CMS. If I could do some of each, and the others don't have the same bandwidth requirements, I could run all cores on LHC. ID: 47664 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1110 Credit: 9,342,950 RAC: 2,062	Message 47665 - Posted: 12 Jan 2023, 14:53:01 UTC - in response to Message 47664. Rough estimation: A 7 Mbit/s upload bandwidth will be fully saturated by 50 CMS VMs running concurrently. 30. But I guess it depends on CPU speed. I have 126 cores :-/ Did CMS used to work like this? I don't recall having a bandwidth problem before. Why can't it upload them at the end using Boinc in the normal way in the queue instead of stalling the processing? Is there an easy way to get equal numbers of Theory and Atlas aswell? I've asked for anything and only get CMS. If I could do some of each, and the others don't have the same bandwidth requirements, I could run all cores on LHC. From your latest good task: 2023-01-11 20:04:50 (7092): Guest Log: [INFO] Could not find a local HTTP proxy 2023-01-11 20:04:50 (7092): Guest Log: [INFO] CVMFS and Frontier will have to use DIRECT connections 2023-01-11 20:04:50 (7092): Guest Log: [INFO] This makes the application less efficient 2023-01-11 20:04:50 (7092): Guest Log: [INFO] It also puts higher load on the project servers 2023-01-11 20:04:50 (7092): Guest Log: [INFO] Setting up a local HTTP proxy is highly recommended 2023-01-11 20:04:51 (7092): Guest Log: [INFO] Advice can be found in the project forum Remember that you download a lot of data as well. If you had a local squid proxy you could greatly reduce your downloads and the consequent time spent. I think the instructions are in the Number Crunching forum. ID: 47665 · Reply Quote

Mr P Hucker Send message Joined: 12 Aug 06 Posts: 448 Credit: 12,972,383 RAC: 8,288	Message 47666 - Posted: 12 Jan 2023, 15:09:09 UTC - in response to Message 47665. Remember that you download a lot of data as well. If you had a local squid proxy you could greatly reduce your downloads and the consequent time spent. I think the instructions are in the Number Crunching forum. I have heard of that but it sounds complicated. I'm a Windows user, not a Linux geek. I did set up a simple Apache server once but that was a decade or two ago. However it's my upload that's limiting how many I can run. I assume the proxy wouldn't help there. Can I ask why common files which more than one task is going to need aren't retained in the Boinc folders for them all to access? ID: 47666 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2679 Credit: 286,751,412 RAC: 84,822	Message 47667 - Posted: 12 Jan 2023, 15:34:19 UTC - in response to Message 47666. I have heard of that but it sounds complicated. I'm a Windows user, not a Linux geek. I did set up a simple Apache server once but that was a decade or two ago. Less complicated than to set up an Apache server. And Ivan didn't mention Linux - Squid is also available on Windows. However it's my upload that's limiting how many I can run. I assume the proxy wouldn't help there. Right, it doesn't help much for uploads, just a very little bit since it avoids TCP connections which always send packets back to the servers. Can I ask why common files which more than one task is going to need aren't retained in the Boinc folders for them all to access? 1. Data is stored on huge repositories and DBs. (... think larger ... still much larger!) 2. Outdated data on those repositories can easily be modified by the scientists and quickly be distributed worldwide. ID: 47667 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2679 Credit: 286,751,412 RAC: 84,822	Message 47668 - Posted: 12 Jan 2023, 15:56:03 UTC - in response to Message 47664. Did CMS used to work like this? Yes, for years. Why can't it upload them at the end using Boinc in the normal way in the queue instead of stalling the processing? As of now we are currently running 614 CMS jobs via BOINC while CERN and affiliated datacenters are running more than 114000 jobs (not via BOINC). They just use different sets of input parameters. This requires a process setup optimized for the 'big dogs'. Is there an easy way to get equal numbers of Theory and Atlas aswell? I've asked for anything and only get CMS. If I could do some of each, and the others don't have the same bandwidth requirements, I could run all cores on LHC. Since the standard BOINC server is used it sends out what is next in the task queue and accepted by the requesting client. Best would be to run multiple BOINC clients on the same box and connect them to different venues. Each venue can then be set to run either ATLAS/CMS/Theory. ID: 47668 · Reply Quote

Mr P Hucker Send message Joined: 12 Aug 06 Posts: 448 Credit: 12,972,383 RAC: 8,288	Message 47672 - Posted: 14 Jan 2023, 12:50:04 UTC Found a nice way of monitoring traffic. I have the internet router now connected to only my main desktop in the house. A second network card has another ethernet cable going to the garage. Performance monitor in Windows still monitors the two connections in the bridge seperately (stupid task manager does not). It may be possible to display it in performance monitor, but since I already use MSI Afterburner, I can import it into there. I now have a seperate graph for each of: Upload from any machine to internet Download from internet to any machine (These two are important so I can easily see if it's causing a bottleneck getting or sending data required for all the CPUs to be busy) Upload from garage machines to internet (or to the house machine which is rare unless I'm installing stuff or transferring files) Download from the internet (or from the house machine which is rare unless I'm installing stuff or transferring files) to garage machines I shall have a go with squid later and see how much difference it makes. ID: 47672 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2679 Credit: 286,751,412 RAC: 84,822	Message 47674 - Posted: 15 Jan 2023, 9:28:58 UTC - in response to Message 47672. https://lhcathome.cern.ch/lhcathome/result.php?resultid=377315268 Looks like the P.H. LAN as a whole is now misconfigured. The publicly available logfiles don't tell why but there's now way to complete a CMS subtask within only 2:30 min CPU-time. My guess would be that some internet data requested by deeper level scripts can't be downloaded (=> timeout) and the error doesn't arrive at the BOINC level. It would require a CERN expert to look through those deeper level logs. @P.H. Since you set up very unusual packet redirections you may have forgotten to forward all required ports in both directions. ID: 47674 · Reply Quote

Mr P Hucker Send message Joined: 12 Aug 06 Posts: 448 Credit: 12,972,383 RAC: 8,288	Message 47675 - Posted: 15 Jan 2023, 9:45:47 UTC - in response to Message 47674. Last modified: 15 Jan 2023, 10:03:43 UTC https://lhcathome.cern.ch/lhcathome/result.php?resultid=377315268 Looks like the P.H. LAN as a whole is now misconfigured. The publicly available logfiles don't tell why but there's now way to complete a CMS subtask within only 2:30 min CPU-time. My guess would be that some internet data requested by deeper level scripts can't be downloaded (=> timeout) and the error doesn't arrive at the BOINC level. It would require a CERN expert to look through those deeper level logs. @P.H. Since you set up very unusual packet redirections you may have forgotten to forward all required ports in both directions. This is absurd. You need to fix your programming. The only thing wrong at my end is the lack of bandwidth. If the program cannot get or send what it needs to LHC, it should keep retrying or give up and say so. Or maybe just be more patient? I can download or upload anything with anything without a problem. Why can't your program cope with a slow connection? I have not misconfigured anything. I have FTTC --> ISP router --> Windows 11 PC with 2 bridged ethernets --> unmanaged switch --> 7 Windows 11 PCs. Nothing unusual, nothing special. There are no "very unusual packet redirections" unless that was when I turned on a VPN (on the main machine for a short time (an hour)). Can't CMS cope with that? I don't know if the other 7 PCs would end up going through the VPN or not. It's Ivacy if that means anything to you. "Ryzen", the computer which did the task you linked to, is the main PC with the bridge and occasionally a VPN. It's in Norway, which is sort of halfway between us anyway! All 8 machines can access the internet just fine, I can view webpages, Boinc uploads and downloads just fine. It's your program or virtual box getting all confused. I'm changing some machines over to Atlas until the bandwidth isn't a limitation, to get around the bad programming. I'm monitoring for weird stuff like not taking many hours of CPU time, and moving more PCs over to Atlas until that stops happening. ID: 47675 · Reply Quote

Mr P Hucker Send message Joined: 12 Aug 06 Posts: 448 Credit: 12,972,383 RAC: 8,288	Message 47676 - Posted: 15 Jan 2023, 9:54:45 UTC - in response to Message 47667. I have heard of that but it sounds complicated. I'm a Windows user, not a Linux geek. I did set up a simple Apache server once but that was a decade or two ago. Less complicated than to set up an Apache server. Doesn't seem like it's necessary, hardly any of the data transfer is in the download direction. And Ivan didn't mention Linux - Squid is also available on Windows. I actually meant Linux users are more likely to know how to set up stuff like Squid. Can I ask why common files which more than one task is going to need aren't retained in the Boinc folders for them all to access? 1. Data is stored on huge repositories and DBs. (... think larger ... still much larger!) 2. Outdated data on those repositories can easily be modified by the scientists and quickly be distributed worldwide. I don't see why Squid is any different to Boinc caching it itself (other than Squid being for all my PCs instead of just one). When a CMS task on my machine requests a file, why can't it have a local copy here from last time, ask the server what the latest one is, and if it's dated the same, not bother downloading it? I assume that's what Squid is doing. And I assume it's what web browsers do with images. ID: 47676 · Reply Quote

LHC@home