Message boards :
CMS Application :
CMS computation error in 30 seconds every time
Message board moderation
Author | Message |
---|---|
Send message Joined: 12 Aug 06 Posts: 429 Credit: 10,591,167 RAC: 702 |
https://lhcathome.cern.ch/lhcathome/result.php?resultid=375721684 Ryzen 9 3900XT, virtualisation on, VB 7.0.4 with extension pack. I tried clearing out the VB environment (by deleting everything on the left column in the welcome screen). Didn't fix it. I tried deleting files in the LHC folder under projects, but that really annoyed it so I put them back (I thought they might be corrupt and I'd be forcing them to download again). The files were: CMS_2022_09_07_prod.vdi CMS_2022_09_07_prod.xml vboxwrapper_26198ab7_windows_x86_64.exe vboxwrapper_26206_windows_x86_64.exe |
Send message Joined: 12 Aug 06 Posts: 429 Credit: 10,591,167 RAC: 702 |
It's happening on all my very different machines. Although they all have the same (latest as of a few weeks ago) VB installed. I've disabled CMS for now until someone can interpret the error message. I was running them ok recently on most machines, and the only change I can think of is updating VB from 5 to 7. |
Send message Joined: 12 Aug 06 Posts: 429 Credit: 10,591,167 RAC: 702 |
Actually one of eight machines runs them ok. I can't think why that one is any different. This one works: https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10819977 This one does not work: https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10772281 |
Send message Joined: 2 May 07 Posts: 2244 Credit: 173,902,375 RAC: 456 |
2023-01-06 14:36:39 (28516): Adding virtual disk drive to VM. (CMS_2022_09_07_prod.vdi) 2023-01-06 14:36:46 (28516): Error in deregister parent vdi for VM: -2135228404 Command: VBoxManage -q closemedium "C:\programdata\BOINC/projects/lhcathome.cern.ch_lhcathome/CMS_2022_09_07_prod.vdi" Output: VBoxManage.exe: error: Cannot close medium 'C:\programdata\BOINC\projects\lhcathome.cern.ch_lhcathome\CMS_2022_09_07_prod.vdi' because it has 4 child media VBoxManage.exe: error: Details: code VBOX_E_OBJECT_IN_USE (0x80bb000c), component MediumWrap, interface IMedium, callee IUnknown VBoxManage.exe: error: Context: "Close()" at line 1862 of file VBoxManageDisk.cpp Have you checked the virtualmedia manager in Virtualbox? |
Send message Joined: 15 Jun 08 Posts: 2541 Credit: 254,608,838 RAC: 34,609 |
None of your machines delivers valid CMS results. Even this one doesn't (CPU times of just a few minutes are much too low): https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10819977 Unfortunately the reason is not reported back to higher script levels, hence BOINC gets a 'success'. As for errors like these: https://lhcathome.cern.ch/lhcathome/result.php?resultid=375708953 1. Shut down BOINC 2. Clean the VirtualBox Medium Registry -> Remove all CMS disks (parent and all related children) 3. Restart BOINC 4. Reset LHC@home to force fresh vdi downloads 5. Start only 1 CMS task (as a 'pilot') to allow Vbox a correct vdi registration 6. Then (after 30-60 s) start other CMS tasks The more you start concurrently the more pressure you put on the disk IO as well as on the LAN/internet. |
Send message Joined: 12 Aug 06 Posts: 429 Credit: 10,591,167 RAC: 702 |
My good PC you mentioned is now doing them ok. The ones it's running are at 1.5 hours CPU time so far. Disk IO isn't a problem, half have NVME the other half SSD. I don't use rotary disks except backups and TV/Film/security camera storage. I think last time I ran CMS tasks I was maxing out the uplink on my internet, but shouldn't the tasks just be patient? Between them I was sending about 6.5Mbit successfully (my uplink line speed) I'll attempt to clean the dodgy PCs' VBs in a moment. Maybe upgrading from 5 to 7 left corruptions? |
Send message Joined: 12 Aug 06 Posts: 429 Credit: 10,591,167 RAC: 702 |
I've fixed one of the PCs following the steps you mentioned, and will now do the same with the others. Thanks. I must have corrupted the VB environment somehow, probably changing it from V6 to V5 to V7. |
Send message Joined: 12 Aug 06 Posts: 429 Credit: 10,591,167 RAC: 702 |
Unfortunately the reason is not reported back to higher script levels, hence BOINC gets a 'success'.That's worrying, they're marked as valid in my list of tasks on the server. Does the system later notice a problem and resend those tasks? Or does it royally screw up the science? I will of course be keeping an eye on the run time vs CPU time on the server list of tasks to make sure mine behave and are useful. |
Send message Joined: 29 Aug 05 Posts: 1061 Credit: 7,737,455 RAC: 298 |
In general, yes. This runs at two levels -- WMAgent produces jobs and sends them to the HTCondor server. Your VM instances (i.e. BOINC tasks) ask the condor server for a job. If that job terminates with an error, or your VM goes out of contact from the server for too long (currently two hours), the condor server requeues it and sends it to a new VM when the queueing allows. If there are several errors for the one job (currently three, IIRC) the condor server notifies the WMAgent which then itself requeues the job for future resubmission back to the condor server. If the job terminates without error, then the VM will ask for another job, up until 12+ hours have elapsed in total.Unfortunately the reason is not reported back to higher script levels, hence BOINC gets a 'success'.That's worrying, they're marked as valid in my list of tasks on the server. Does the system later notice a problem and resend those tasks? Or does it royally screw up the science?No, if you look at the job graphs from the homepage, you will see 5-10% job failures, but these are the primary failures seen by condor. In the (unfortunately non-public) monitoring we see the ultimate failure rate is essentially zero for every 20,000-job workflow submission. We do tend to be generous, and allow credits for CPU time given even if we could detect a failure -- as alluded to above -- but egregious errors will not get any credit, though. I will of course be keeping an eye on the run time vs CPU time on the server list of tasks to make sure mine behave and are useful.You should, ideally, be seeing task logs that look like mine -- https://lhcathome.cern.ch/lhcathome/results.php?userid=14095 -- tasks running for 12 hours or so with CPU time slightly less. Each task (VM instance) is therefore running 5 or 6 2-hour CMS jobs before terminating to allow BOINC to start a new task. |
Send message Joined: 12 Aug 06 Posts: 429 Credit: 10,591,167 RAC: 702 |
All seems to be going well except I notice I'm running short of internet bandwidth. Is CMS different to a year or so ago? It seems to max out my 7 Mbit upload, using only a third of my cores! I'm supposed to get 200 Mbit upload some time this year though - the engineers are digging up neighbouring streets as we speak! |
Send message Joined: 15 Jun 08 Posts: 2541 Credit: 254,608,838 RAC: 34,609 |
Rough estimation: A 7 Mbit/s upload bandwidth will be fully saturated by 50 CMS VMs running concurrently. |
Send message Joined: 12 Aug 06 Posts: 429 Credit: 10,591,167 RAC: 702 |
Rough estimation:30. But I guess it depends on CPU speed. I have 126 cores :-/ Did CMS used to work like this? I don't recall having a bandwidth problem before. Why can't it upload them at the end using Boinc in the normal way in the queue instead of stalling the processing? Is there an easy way to get equal numbers of Theory and Atlas aswell? I've asked for anything and only get CMS. If I could do some of each, and the others don't have the same bandwidth requirements, I could run all cores on LHC. |
Send message Joined: 29 Aug 05 Posts: 1061 Credit: 7,737,455 RAC: 298 |
Rough estimation:30. But I guess it depends on CPU speed. I have 126 cores :-/ From your latest good task: 2023-01-11 20:04:50 (7092): Guest Log: [INFO] Could not find a local HTTP proxy 2023-01-11 20:04:50 (7092): Guest Log: [INFO] CVMFS and Frontier will have to use DIRECT connections 2023-01-11 20:04:50 (7092): Guest Log: [INFO] This makes the application less efficient 2023-01-11 20:04:50 (7092): Guest Log: [INFO] It also puts higher load on the project servers 2023-01-11 20:04:50 (7092): Guest Log: [INFO] Setting up a local HTTP proxy is highly recommended 2023-01-11 20:04:51 (7092): Guest Log: [INFO] Advice can be found in the project forumRemember that you download a lot of data as well. If you had a local squid proxy you could greatly reduce your downloads and the consequent time spent. I think the instructions are in the Number Crunching forum. |
Send message Joined: 12 Aug 06 Posts: 429 Credit: 10,591,167 RAC: 702 |
Remember that you download a lot of data as well. If you had a local squid proxy you could greatly reduce your downloads and the consequent time spent. I think the instructions are in the Number Crunching forum.I have heard of that but it sounds complicated. I'm a Windows user, not a Linux geek. I did set up a simple Apache server once but that was a decade or two ago. However it's my upload that's limiting how many I can run. I assume the proxy wouldn't help there. Can I ask why common files which more than one task is going to need aren't retained in the Boinc folders for them all to access? |
Send message Joined: 15 Jun 08 Posts: 2541 Credit: 254,608,838 RAC: 34,609 |
I have heard of that but it sounds complicated. I'm a Windows user, not a Linux geek. I did set up a simple Apache server once but that was a decade or two ago. Less complicated than to set up an Apache server. And Ivan didn't mention Linux - Squid is also available on Windows. However it's my upload that's limiting how many I can run. I assume the proxy wouldn't help there. Right, it doesn't help much for uploads, just a very little bit since it avoids TCP connections which always send packets back to the servers. Can I ask why common files which more than one task is going to need aren't retained in the Boinc folders for them all to access? 1. Data is stored on huge repositories and DBs. (... think larger ... still much larger!) 2. Outdated data on those repositories can easily be modified by the scientists and quickly be distributed worldwide. |
Send message Joined: 15 Jun 08 Posts: 2541 Credit: 254,608,838 RAC: 34,609 |
Did CMS used to work like this? Yes, for years. Why can't it upload them at the end using Boinc in the normal way in the queue instead of stalling the processing? As of now we are currently running 614 CMS jobs via BOINC while CERN and affiliated datacenters are running more than 114000 jobs (not via BOINC). They just use different sets of input parameters. This requires a process setup optimized for the 'big dogs'. Is there an easy way to get equal numbers of Theory and Atlas aswell? I've asked for anything and only get CMS. If I could do some of each, and the others don't have the same bandwidth requirements, I could run all cores on LHC. Since the standard BOINC server is used it sends out what is next in the task queue and accepted by the requesting client. Best would be to run multiple BOINC clients on the same box and connect them to different venues. Each venue can then be set to run either ATLAS/CMS/Theory. |
Send message Joined: 12 Aug 06 Posts: 429 Credit: 10,591,167 RAC: 702 |
Found a nice way of monitoring traffic. I have the internet router now connected to only my main desktop in the house. A second network card has another ethernet cable going to the garage. Performance monitor in Windows still monitors the two connections in the bridge seperately (stupid task manager does not). It may be possible to display it in performance monitor, but since I already use MSI Afterburner, I can import it into there. I now have a seperate graph for each of: Upload from any machine to internet Download from internet to any machine (These two are important so I can easily see if it's causing a bottleneck getting or sending data required for all the CPUs to be busy) Upload from garage machines to internet (or to the house machine which is rare unless I'm installing stuff or transferring files) Download from the internet (or from the house machine which is rare unless I'm installing stuff or transferring files) to garage machines I shall have a go with squid later and see how much difference it makes. |
Send message Joined: 15 Jun 08 Posts: 2541 Credit: 254,608,838 RAC: 34,609 |
https://lhcathome.cern.ch/lhcathome/result.php?resultid=377315268 Looks like the P.H. LAN as a whole is now misconfigured. The publicly available logfiles don't tell why but there's now way to complete a CMS subtask within only 2:30 min CPU-time. My guess would be that some internet data requested by deeper level scripts can't be downloaded (=> timeout) and the error doesn't arrive at the BOINC level. It would require a CERN expert to look through those deeper level logs. @P.H. Since you set up very unusual packet redirections you may have forgotten to forward all required ports in both directions. |
Send message Joined: 12 Aug 06 Posts: 429 Credit: 10,591,167 RAC: 702 |
https://lhcathome.cern.ch/lhcathome/result.php?resultid=377315268This is absurd. You need to fix your programming. The only thing wrong at my end is the lack of bandwidth. If the program cannot get or send what it needs to LHC, it should keep retrying or give up and say so. Or maybe just be more patient? I can download or upload anything with anything without a problem. Why can't your program cope with a slow connection? I have not misconfigured anything. I have FTTC --> ISP router --> Windows 11 PC with 2 bridged ethernets --> unmanaged switch --> 7 Windows 11 PCs. Nothing unusual, nothing special. There are no "very unusual packet redirections" unless that was when I turned on a VPN (on the main machine for a short time (an hour)). Can't CMS cope with that? I don't know if the other 7 PCs would end up going through the VPN or not. It's Ivacy if that means anything to you. "Ryzen", the computer which did the task you linked to, is the main PC with the bridge and occasionally a VPN. It's in Norway, which is sort of halfway between us anyway! All 8 machines can access the internet just fine, I can view webpages, Boinc uploads and downloads just fine. It's your program or virtual box getting all confused. I'm changing some machines over to Atlas until the bandwidth isn't a limitation, to get around the bad programming. I'm monitoring for weird stuff like not taking many hours of CPU time, and moving more PCs over to Atlas until that stops happening. |
Send message Joined: 12 Aug 06 Posts: 429 Credit: 10,591,167 RAC: 702 |
Doesn't seem like it's necessary, hardly any of the data transfer is in the download direction.I have heard of that but it sounds complicated. I'm a Windows user, not a Linux geek. I did set up a simple Apache server once but that was a decade or two ago.Less complicated than to set up an Apache server. And Ivan didn't mention Linux - Squid is also available on Windows.I actually meant Linux users are more likely to know how to set up stuff like Squid. I don't see why Squid is any different to Boinc caching it itself (other than Squid being for all my PCs instead of just one). When a CMS task on my machine requests a file, why can't it have a local copy here from last time, ask the server what the latest one is, and if it's dated the same, not bother downloading it? I assume that's what Squid is doing. And I assume it's what web browsers do with images.Can I ask why common files which more than one task is going to need aren't retained in the Boinc folders for them all to access?1. Data is stored on huge repositories and DBs. (... think larger ... still much larger!) |
©2024 CERN