Message boards : CMS Application : CMS@Home difficulties in attempts to prepare for multi-core jobs
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · 4 . . . 6 · Next

AuthorMessage
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1006
Credit: 6,272,232
RAC: 315
Message 49551 - Posted: 14 Feb 2024, 10:09:07 UTC

We've been having some problems lately as we prepare to allow multi-core jobs to be run in CMS@Home (you've probably noticed...). Unfortunately some of the configurations are beyond our control, and we have to request changes as we find problems and determine a potential fix for them.
We ask for your patience at this time while we work through the difficulties, and would fully understand if you chose to pause your participation in the project while we try to get on top of things.
ID: 49551 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1280
Credit: 8,496,817
RAC: 2,374
Message 49552 - Posted: 14 Feb 2024, 12:36:01 UTC

We did have multi-core CMS-tasks in the past.
However at that time a 4 core (e.g.) VM did run 4 CMS-jobs at the same time and that is not very usefull.
The only useful multi-core application would be 1 CMS-job using more cores/threads to speed up that single job.
ID: 49552 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2104
Credit: 159,819,191
RAC: 123,837
Message 49553 - Posted: 14 Feb 2024, 13:51:21 UTC - in response to Message 49552.  

This is an example for an CMS-Task running now:
stderr:
INFO:root:RUNNING SCRAM SCRIPTS
INFO:root:Executing CMSSW. args: ['/bin/bash', '/srv/job/WMTaskSpace/cmsRun1/cmsRun1-main.sh', '', 'slc7_amd64_gcc700', 'scramv1', 'CMSSW', 'CMSSW_11_0_0_pre1', 'FrameworkJobReport.xml', 'cmsRun', 'PSet.py', '']
INFO:root:PSS: 701620; RSS: 700240; PCPU: 85.6; PMEM: 34.4
INFO:root:PSS: 541572; RSS: 540820; PCPU: 90.8; PMEM: 26.5
INFO:root:PSS: 612008; RSS: 610932; PCPU: 92.8; PMEM: 30.0
INFO:root:PSS: 618280; RSS: 617404; PCPU: 93.8; PMEM: 30.3
INFO:root:PSS: 642996; RSS: 642180; PCPU: 94.4; PMEM: 31.5
INFO:root:PSS: 670308; RSS: 669424; PCPU: 94.9; PMEM: 32.8
INFO:root:PSS: 671120; RSS: 670496; PCPU: 95.1; PMEM: 32.9
INFO:root:PSS: 680184; RSS: 679308; PCPU: 95.4; PMEM: 33.3
INFO:root:PSS: 680608; RSS: 679832; PCPU: 95.6; PMEM: 33.4
INFO:root:PSS: 680972; RSS: 680132; PCPU: 95.7; PMEM: 33.4
INFO:root:PSS: 682196; RSS: 681960; PCPU: 95.9; PMEM: 33.5
INFO:root:PSS: 683192; RSS: 681560; PCPU: 96.0; PMEM: 33.4
INFO:root:PSS: 682448; RSS: 681656; PCPU: 96.0; PMEM: 33.4
Atlas have multicore.
ID: 49553 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1006
Credit: 6,272,232
RAC: 315
Message 49554 - Posted: 14 Feb 2024, 14:17:54 UTC - in response to Message 49552.  

We did have multi-core CMS-tasks in the past.
However at that time a 4 core (e.g.) VM did run 4 CMS-jobs at the same time and that is not very usefull.
The only useful multi-core application would be 1 CMS-job using more cores/threads to speed up that single job.

Yes, that wasn't very efficient -- I found that running just a 2-core VM with two individual cmsRun jobs was the most efficient. CMS tends to want to run 4-thread jobs in a 4-core VM these days (maybe even x8) so that's why we want to try to get that running. Coordinating with all the different levels of configuration is the main problem.
ID: 49554 · Report as offensive     Reply Quote
greg_be

Send message
Joined: 28 Dec 08
Posts: 318
Credit: 4,235,262
RAC: 3,879
Message 49571 - Posted: 15 Feb 2024, 18:06:35 UTC

What is this all about?

VBoxManage -q closemedium "D:\data/projects/lhcathome.cern.ch_lhcathome/CMS_2022_09_07_prod.vdi"
Output:
VBoxManage.exe: error: Cannot close medium 'D:\data\projects\lhcathome.cern.ch_lhcathome\CMS_2022_09_07_prod.vdi' because it has 1 child media
VBoxManage.exe: error: Details: code VBOX_E_OBJECT_IN_USE (0x80bb000c), component MediumWrap, interface IMedium, callee IUnknown
VBoxManage.exe: error: Context: "Close()" at line 1875 of file VBoxManageDisk.cpp

2024-02-15 18:53:38 (25156): Could not create VM
2024-02-15 18:53:38 (25156): ERROR: VM failed to start
2024-02-15 18:53:38 (25156): Powering off VM.
2024-02-15 18:53:38 (25156): Deregistering VM. (boinc_418a52e6b5534c75, slot#26)
2024-02-15 18:53:38 (25156): Removing network bandwidth throttle group from VM.
2024-02-15 18:53:39 (25156): Removing VM from VirtualBox.

Every single CMS task I get bombs like this.
I think it time to take a break from this project until you guys can figure out whats going on.
My RAC is tanked to almost nothing and I have over 60 errors so far between CMS and Theory.
ID: 49571 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2104
Credit: 159,819,191
RAC: 123,837
Message 49572 - Posted: 15 Feb 2024, 18:13:45 UTC - in response to Message 49571.  

Have your Virtualbox in manager yellow triangle?
This child error is normally from this.
ID: 49572 · Report as offensive     Reply Quote
greg_be

Send message
Joined: 28 Dec 08
Posts: 318
Credit: 4,235,262
RAC: 3,879
Message 49573 - Posted: 15 Feb 2024, 22:42:55 UTC - in response to Message 49572.  

Have your Virtualbox in manager yellow triangle?
This child error is normally from this.


No..I just reinstalled it this morning (EU time) as well as the extension pack.
The last successful task was 5 days ago and then everything went to hell.
But that was a Theory task.
Last CMS to complete ok was 9 February and nothing since then has completed.

At around 0830 CET I reinstalled Vbox. I let it delete the previous copy and install a fresh copy.
After Vbox was finished with the install I ran extension manager.
Nothing changed.

I will try a test overnight. I will use Revo uninstalled to remove Vbox from the system and registry.
I will use Wise365 to clean my system.
I will reinstall Vbox and restart BOINC.

If in the morning there is still problems with CMS, then I don't know whats going on.
ID: 49573 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1280
Credit: 8,496,817
RAC: 2,374
Message 49574 - Posted: 16 Feb 2024, 7:08:00 UTC - in response to Message 49573.  

Have your Virtualbox in manager yellow triangle?
This child error is normally from this.
This was a clear hint, but you did not look at the right place.
Use VirtualBox Manager. Right from Tools you see a pin and three small lines.
Select Media and remove CMS_2022_09_07_prod.vdi from the list, but don't delete the disk file itself.
ID: 49574 · Report as offensive     Reply Quote
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 808
Credit: 652,863,129
RAC: 280,216
Message 49575 - Posted: 16 Feb 2024, 7:09:31 UTC - in response to Message 49574.  

I had one my one of my PCs
ID: 49575 · Report as offensive     Reply Quote
greg_be

Send message
Joined: 28 Dec 08
Posts: 318
Credit: 4,235,262
RAC: 3,879
Message 49576 - Posted: 16 Feb 2024, 7:14:44 UTC - in response to Message 49574.  

Have your Virtualbox in manager yellow triangle?
This child error is normally from this.
This was a clear hint, but you did not look at the right place.
Use VirtualBox Manager. Right from Tools you see a pin and three small lines.
Select Media and remove CMS_2022_09_07_prod.vdi from the list, but don't delete the disk file itself.



Thanks Crystal, that's something I did not know how to do.
Even Atlas was all lit up in there.
Not much of CMS was lit up, but I cleared it out.
I'm off to work, so I'll check when I get home.
ID: 49576 · Report as offensive     Reply Quote
greg_be

Send message
Joined: 28 Dec 08
Posts: 318
Credit: 4,235,262
RAC: 3,879
Message 49583 - Posted: 16 Feb 2024, 13:54:04 UTC - in response to Message 49576.  

Question: What caused the problem with Vbox in the first place?
---

Queue is full still from other projects. Maybe tonight there will be something from here.
I'll keep an eye on it.
ID: 49583 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1280
Credit: 8,496,817
RAC: 2,374
Message 49584 - Posted: 16 Feb 2024, 14:32:40 UTC - in response to Message 49583.  

Question: What caused the problem with Vbox in the first place?
Could be several reasons. Most common with BOINC:

- Suspend all VBox tasks at once with 'keep in memory' ticked off in BOINC preferences.
- Stop BOINC client with several VBox tasks running.
- Reboot the system without stopping BOINC properly.
- Start several VBox tasks at once.
ID: 49584 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1690
Credit: 104,044,175
RAC: 121,972
Message 49664 - Posted: 27 Feb 2024, 7:28:08 UTC - in response to Message 49551.  

Ivan wrote on Feb. 14:
We've been having some problems lately as we prepare to allow multi-core jobs to be run in CMS@Home (you've probably noticed...). Unfortunately some of the configurations are beyond our control, and we have to request changes as we find problems and determine a potential fix for them.
We ask for your patience at this time while we work through the difficulties, and would fully understand if you chose to pause your participation in the project while we try to get on top of things.
Ivan, any progress yet ?
ID: 49664 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1006
Credit: 6,272,232
RAC: 315
Message 49774 - Posted: 15 Mar 2024, 15:43:57 UTC - in response to Message 49664.  

Ivan wrote on Feb. 14:
We've been having some problems lately as we prepare to allow multi-core jobs to be run in CMS@Home (you've probably noticed...). Unfortunately some of the configurations are beyond our control, and we have to request changes as we find problems and determine a potential fix for them.
We ask for your patience at this time while we work through the difficulties, and would fully understand if you chose to pause your participation in the project while we try to get on top of things.
Ivan, any progress yet ?

We've got a workflow lined up for 4-core jobs, but it hasn't progressed to running status yet. I suspect the WMAgent is waiting for my current single-core jobs to run down, so I'm holding on to see if it does start later on today. If it does, people with LHC@Home-dev access can try to enable 4-core jobs in their computing preferences -- this option is not yet available for mainstream LHC@Home volunteers, so they will continue to run just single-core jobs if they are available. If you do have -dev membership and enable 4-core jobs, you will (at the moment) start a 4-core VM but it will only run a single-thread job.
I have my home PC already set up to run 4-core -dev jobs; when (if...) I see 4-core jobs in the queue I will try to acquire one and let you know if it runs. If that doesn't fly, I'll have to submit a new batch of single-core jobs -- there may be a period with no jobs available if I don't juggle the submissions just so.
Ah, Daniele has just submitted another 4-core workflow. It's currently in "staging" so it's just a matter of hurry up and wait.
ID: 49774 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1280
Credit: 8,496,817
RAC: 2,374
Message 49775 - Posted: 15 Mar 2024, 18:27:35 UTC

I think multi-core not yet arrived. Not sure what I could see, because the Consoles do not display usefull info.
I created a dual core VM (not 4 cause other duties on that laptop), but I see only 1 cmsRun using 100% CPU and some other cpu-usage from other processes.
Total 102% CPU after 24 minutes.
ID: 49775 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1006
Credit: 6,272,232
RAC: 315
Message 49780 - Posted: 16 Mar 2024, 15:10:22 UTC - in response to Message 49775.  

I think multi-core not yet arrived. Not sure what I could see, because the Consoles do not display usefull info.
I created a dual core VM (not 4 cause other duties on that laptop), but I see only 1 cmsRun using 100% CPU and some other cpu-usage from other processes.
Total 102% CPU after 24 minutes.

Yes, the 4-core batch didn't make it past "acquired" and into "running". I'm submitting smaller single-core job batches for the next few days while we work out why the multicore jobs didn't start. There may be disruptions if I don't arrange my waking hours to coincide with the need to submit new workflows...
ID: 49780 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2413
Credit: 226,631,014
RAC: 130,464
Message 49781 - Posted: 16 Mar 2024, 15:50:40 UTC - in response to Message 49780.  

This explains why I got only singlecore jobs here and on -dev this afternoon although the VMs were all configured running 4 cores.
ID: 49781 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1006
Credit: 6,272,232
RAC: 315
Message 49787 - Posted: 17 Mar 2024, 18:52:19 UTC

...the best-laid plans...
New single-core jobs on the way.
ID: 49787 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1006
Credit: 6,272,232
RAC: 315
Message 49814 - Posted: 22 Mar 2024, 15:37:35 UTC

I've submitted a new batch of jobs, specifying Multicore=4 instead of 1. Unlike the "true" 4-core workflow that never got into "running" status, this batch has progressed that far and has 500 jobs "pending". If you have a CMS@Home-dev setup specifying 4-core VMs, please see if you get a multicore job while we investigate how HTCondor is coping with the new batch. Thanks.
ID: 49814 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1006
Credit: 6,272,232
RAC: 315
Message 49815 - Posted: 22 Mar 2024, 16:01:20 UTC - in response to Message 49814.  

OK, the new batch is "running" with 500 jobs pending. My -dev machine is running a 4-core VM but is only running a single-core job. There are ~400 single-core jobs still pending so we'll have to see what happens when that queue dries up.
ID: 49815 · Report as offensive     Reply Quote
1 · 2 · 3 · 4 . . . 6 · Next

Message boards : CMS Application : CMS@Home difficulties in attempts to prepare for multi-core jobs


©2024 CERN