Message boards :
CMS Application :
CMS@Home difficulties in attempts to prepare for multi-core jobs
Message board moderation
Author | Message |
---|---|
Send message Joined: 29 Aug 05 Posts: 1061 Credit: 7,737,455 RAC: 298 |
We've been having some problems lately as we prepare to allow multi-core jobs to be run in CMS@Home (you've probably noticed...). Unfortunately some of the configurations are beyond our control, and we have to request changes as we find problems and determine a potential fix for them. We ask for your patience at this time while we work through the difficulties, and would fully understand if you chose to pause your participation in the project while we try to get on top of things. |
Send message Joined: 14 Jan 10 Posts: 1422 Credit: 9,484,585 RAC: 1,266 |
We did have multi-core CMS-tasks in the past. However at that time a 4 core (e.g.) VM did run 4 CMS-jobs at the same time and that is not very usefull. The only useful multi-core application would be 1 CMS-job using more cores/threads to speed up that single job. |
Send message Joined: 2 May 07 Posts: 2244 Credit: 173,902,375 RAC: 456 |
This is an example for an CMS-Task running now: stderr: INFO:root:RUNNING SCRAM SCRIPTS INFO:root:Executing CMSSW. args: ['/bin/bash', '/srv/job/WMTaskSpace/cmsRun1/cmsRun1-main.sh', '', 'slc7_amd64_gcc700', 'scramv1', 'CMSSW', 'CMSSW_11_0_0_pre1', 'FrameworkJobReport.xml', 'cmsRun', 'PSet.py', ''] INFO:root:PSS: 701620; RSS: 700240; PCPU: 85.6; PMEM: 34.4 INFO:root:PSS: 541572; RSS: 540820; PCPU: 90.8; PMEM: 26.5 INFO:root:PSS: 612008; RSS: 610932; PCPU: 92.8; PMEM: 30.0 INFO:root:PSS: 618280; RSS: 617404; PCPU: 93.8; PMEM: 30.3 INFO:root:PSS: 642996; RSS: 642180; PCPU: 94.4; PMEM: 31.5 INFO:root:PSS: 670308; RSS: 669424; PCPU: 94.9; PMEM: 32.8 INFO:root:PSS: 671120; RSS: 670496; PCPU: 95.1; PMEM: 32.9 INFO:root:PSS: 680184; RSS: 679308; PCPU: 95.4; PMEM: 33.3 INFO:root:PSS: 680608; RSS: 679832; PCPU: 95.6; PMEM: 33.4 INFO:root:PSS: 680972; RSS: 680132; PCPU: 95.7; PMEM: 33.4 INFO:root:PSS: 682196; RSS: 681960; PCPU: 95.9; PMEM: 33.5 INFO:root:PSS: 683192; RSS: 681560; PCPU: 96.0; PMEM: 33.4 INFO:root:PSS: 682448; RSS: 681656; PCPU: 96.0; PMEM: 33.4 Atlas have multicore. |
Send message Joined: 29 Aug 05 Posts: 1061 Credit: 7,737,455 RAC: 298 |
We did have multi-core CMS-tasks in the past. Yes, that wasn't very efficient -- I found that running just a 2-core VM with two individual cmsRun jobs was the most efficient. CMS tends to want to run 4-thread jobs in a 4-core VM these days (maybe even x8) so that's why we want to try to get that running. Coordinating with all the different levels of configuration is the main problem. |
Send message Joined: 28 Dec 08 Posts: 339 Credit: 4,865,275 RAC: 129 |
What is this all about? VBoxManage -q closemedium "D:\data/projects/lhcathome.cern.ch_lhcathome/CMS_2022_09_07_prod.vdi" Output: VBoxManage.exe: error: Cannot close medium 'D:\data\projects\lhcathome.cern.ch_lhcathome\CMS_2022_09_07_prod.vdi' because it has 1 child media VBoxManage.exe: error: Details: code VBOX_E_OBJECT_IN_USE (0x80bb000c), component MediumWrap, interface IMedium, callee IUnknown VBoxManage.exe: error: Context: "Close()" at line 1875 of file VBoxManageDisk.cpp 2024-02-15 18:53:38 (25156): Could not create VM 2024-02-15 18:53:38 (25156): ERROR: VM failed to start 2024-02-15 18:53:38 (25156): Powering off VM. 2024-02-15 18:53:38 (25156): Deregistering VM. (boinc_418a52e6b5534c75, slot#26) 2024-02-15 18:53:38 (25156): Removing network bandwidth throttle group from VM. 2024-02-15 18:53:39 (25156): Removing VM from VirtualBox. Every single CMS task I get bombs like this. I think it time to take a break from this project until you guys can figure out whats going on. My RAC is tanked to almost nothing and I have over 60 errors so far between CMS and Theory. |
Send message Joined: 2 May 07 Posts: 2244 Credit: 173,902,375 RAC: 456 |
Have your Virtualbox in manager yellow triangle? This child error is normally from this. |
Send message Joined: 28 Dec 08 Posts: 339 Credit: 4,865,275 RAC: 129 |
Have your Virtualbox in manager yellow triangle? No..I just reinstalled it this morning (EU time) as well as the extension pack. The last successful task was 5 days ago and then everything went to hell. But that was a Theory task. Last CMS to complete ok was 9 February and nothing since then has completed. At around 0830 CET I reinstalled Vbox. I let it delete the previous copy and install a fresh copy. After Vbox was finished with the install I ran extension manager. Nothing changed. I will try a test overnight. I will use Revo uninstalled to remove Vbox from the system and registry. I will use Wise365 to clean my system. I will reinstall Vbox and restart BOINC. If in the morning there is still problems with CMS, then I don't know whats going on. |
Send message Joined: 14 Jan 10 Posts: 1422 Credit: 9,484,585 RAC: 1,266 |
This was a clear hint, but you did not look at the right place.Have your Virtualbox in manager yellow triangle? Use VirtualBox Manager. Right from Tools you see a pin and three small lines. Select Media and remove CMS_2022_09_07_prod.vdi from the list, but don't delete the disk file itself. |
Send message Joined: 27 Sep 08 Posts: 850 Credit: 692,823,409 RAC: 77,584 |
I had one my one of my PCs |
Send message Joined: 28 Dec 08 Posts: 339 Credit: 4,865,275 RAC: 129 |
This was a clear hint, but you did not look at the right place.Have your Virtualbox in manager yellow triangle? Thanks Crystal, that's something I did not know how to do. Even Atlas was all lit up in there. Not much of CMS was lit up, but I cleared it out. I'm off to work, so I'll check when I get home. |
Send message Joined: 28 Dec 08 Posts: 339 Credit: 4,865,275 RAC: 129 |
Question: What caused the problem with Vbox in the first place? --- Queue is full still from other projects. Maybe tonight there will be something from here. I'll keep an eye on it. |
Send message Joined: 14 Jan 10 Posts: 1422 Credit: 9,484,585 RAC: 1,266 |
Question: What caused the problem with Vbox in the first place?Could be several reasons. Most common with BOINC: - Suspend all VBox tasks at once with 'keep in memory' ticked off in BOINC preferences. - Stop BOINC client with several VBox tasks running. - Reboot the system without stopping BOINC properly. - Start several VBox tasks at once. |
Send message Joined: 18 Dec 15 Posts: 1821 Credit: 118,941,165 RAC: 22,029 |
Ivan wrote on Feb. 14: We've been having some problems lately as we prepare to allow multi-core jobs to be run in CMS@Home (you've probably noticed...). Unfortunately some of the configurations are beyond our control, and we have to request changes as we find problems and determine a potential fix for them.Ivan, any progress yet ? |
Send message Joined: 29 Aug 05 Posts: 1061 Credit: 7,737,455 RAC: 298 |
Ivan wrote on Feb. 14:We've been having some problems lately as we prepare to allow multi-core jobs to be run in CMS@Home (you've probably noticed...). Unfortunately some of the configurations are beyond our control, and we have to request changes as we find problems and determine a potential fix for them.Ivan, any progress yet ? We've got a workflow lined up for 4-core jobs, but it hasn't progressed to running status yet. I suspect the WMAgent is waiting for my current single-core jobs to run down, so I'm holding on to see if it does start later on today. If it does, people with LHC@Home-dev access can try to enable 4-core jobs in their computing preferences -- this option is not yet available for mainstream LHC@Home volunteers, so they will continue to run just single-core jobs if they are available. If you do have -dev membership and enable 4-core jobs, you will (at the moment) start a 4-core VM but it will only run a single-thread job. I have my home PC already set up to run 4-core -dev jobs; when (if...) I see 4-core jobs in the queue I will try to acquire one and let you know if it runs. If that doesn't fly, I'll have to submit a new batch of single-core jobs -- there may be a period with no jobs available if I don't juggle the submissions just so. Ah, Daniele has just submitted another 4-core workflow. It's currently in "staging" so it's just a matter of hurry up and wait. |
Send message Joined: 14 Jan 10 Posts: 1422 Credit: 9,484,585 RAC: 1,266 |
I think multi-core not yet arrived. Not sure what I could see, because the Consoles do not display usefull info. I created a dual core VM (not 4 cause other duties on that laptop), but I see only 1 cmsRun using 100% CPU and some other cpu-usage from other processes. Total 102% CPU after 24 minutes. |
Send message Joined: 29 Aug 05 Posts: 1061 Credit: 7,737,455 RAC: 298 |
I think multi-core not yet arrived. Not sure what I could see, because the Consoles do not display usefull info. Yes, the 4-core batch didn't make it past "acquired" and into "running". I'm submitting smaller single-core job batches for the next few days while we work out why the multicore jobs didn't start. There may be disruptions if I don't arrange my waking hours to coincide with the need to submit new workflows... |
Send message Joined: 15 Jun 08 Posts: 2541 Credit: 254,608,838 RAC: 34,609 |
This explains why I got only singlecore jobs here and on -dev this afternoon although the VMs were all configured running 4 cores. |
Send message Joined: 29 Aug 05 Posts: 1061 Credit: 7,737,455 RAC: 298 |
|
Send message Joined: 29 Aug 05 Posts: 1061 Credit: 7,737,455 RAC: 298 |
I've submitted a new batch of jobs, specifying Multicore=4 instead of 1. Unlike the "true" 4-core workflow that never got into "running" status, this batch has progressed that far and has 500 jobs "pending". If you have a CMS@Home-dev setup specifying 4-core VMs, please see if you get a multicore job while we investigate how HTCondor is coping with the new batch. Thanks. |
Send message Joined: 29 Aug 05 Posts: 1061 Credit: 7,737,455 RAC: 298 |
|
©2024 CERN