Message boards :
CMS Application :
CMS@Home difficulties in attempts to prepare for multi-core jobs
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 . . . 9 · Next
Author | Message |
---|---|
Send message Joined: 7 Aug 11 Posts: 104 Credit: 25,221,969 RAC: 25,711 |
I'm still only getting single core work units at this stage though my profile is set for four cores (for Atlas jobs originally) I have a few to get through so I'll just watch as see what pops up. |
Send message Joined: 14 Jan 10 Posts: 1422 Credit: 9,484,585 RAC: 1,882 |
I'll give this multi-core on production server a try. First three tasks had an error cause the new downloaded CMS_2022_09_07.vdi had the same UUID as that one from the dev-system. I resetted the dev-project on my PC and removed the hard disks from VirtualBox media. I also removed my app_config.xml to see what is coming from the server without intervention. I had set 1 task and no limit on CPUs in my project-preferences. Now the task started OK and after a while started processing internal jobs. A 24-core VM was created (no limit) and I see 2 processes cmsRun (each ~14% CPU) and 8 processes cmsExternalGene each consuming ~96% CPU. |
Send message Joined: 27 Sep 08 Posts: 850 Credit: 692,713,859 RAC: 95,524 |
I see that one WU allocates 32 cores and then inside there is 6 processes cmsExternalGene using 1 core each. What is the expected max number inside, as CP say seems like maybe 8? |
Send message Joined: 29 Aug 05 Posts: 1061 Credit: 7,737,455 RAC: 443 |
I see that one WU allocates 32 cores and then inside there is 6 processes cmsExternalGene using 1 core each. As far as I know, the tasks that run the new 4-core jobs should run on 4 cores no matter how many above that number you have allowed in your locale preferences. My experience is that the main process, cmsRun, spawns four threads, each running cmsExternalGenerator, so in your "top" display (Alt-F3) you should see four cmsExternalGenerator processes running at nearly 100% each, with the occasional appearance of the cmsRun master process as it gets its share of the resources.. |
Send message Joined: 27 Sep 08 Posts: 850 Credit: 692,713,859 RAC: 95,524 |
OK, I lock it down to 4 cores and 4.5 GB of memory |
Send message Joined: 14 Jan 10 Posts: 1422 Credit: 9,484,585 RAC: 1,882 |
My experience is that the main process, cmsRun, spawns four threads, each running cmsExternalGenerator, so in your "top" display (Alt-F3) you should see four cmsExternalGenerator processes running at nearly 100% each, with the occasional appearance of the cmsRun master process as it gets its share of the resources..Did you read my post ? Especially the last sentence: A 24-core VM was created (no limit) and I see 2 processes cmsRun (each ~14% CPU) and 8 processes cmsExternalGene each consuming ~96% CPU. The 2 cmsRuns are constantly running using ~13-15% CPU, during a whole run of the 8 cmsExternalGenerator processes. |
Send message Joined: 27 Sep 08 Posts: 850 Credit: 692,713,859 RAC: 95,524 |
@CP yes, I'm not sure why mine had 6 processes. 8 at 100% seems like it would need 8 cores? which is what I set to initally, Ivan seemed to say that 4 was good. I additionally see that each WU allocates 30 GB of working set so I have to think about how to get the sceduler to be OK. |
Send message Joined: 24 Oct 04 Posts: 1174 Credit: 54,887,670 RAC: 8,563 |
https://lhcathome.cern.ch/lhcathome/result.php?resultid=409980539 https://lhcathome.cern.ch/lhcathome/result.php?resultid=409923306 https://lhcathome.cern.ch/lhcathome/result.php?resultid=409860352 (all same host) I keep trying a clean install of the CMS multi here on another host that is exactly the same as the one that works and it keeps giving me Application CMS Simulation 70.20 (vbox64) Name CMS_607434_1713506482.423690 State Downloading Received 4/18/2024 11:16:04 PM Report deadline 5/18/2024 11:16:02 PM Estimated computation size 1,000,000 GFLOPs Executable vboxwrapper_26206_windows_x86_64.exe |
Send message Joined: 14 Jan 10 Posts: 1422 Credit: 9,484,585 RAC: 1,882 |
For the non believers: |
Send message Joined: 24 Oct 04 Posts: 1174 Credit: 54,887,670 RAC: 8,563 |
We have non believers CP ? |
Send message Joined: 18 Dec 15 Posts: 1821 Credit: 118,923,727 RAC: 31,866 |
no jobs for several hours, but the automatic stop of tasks distribution does not seem to work :-( Thus causing thousands of useless tasks being uploaded after about half an hour runtime without results for the science :-( |
Send message Joined: 27 Sep 08 Posts: 850 Credit: 692,713,859 RAC: 95,524 |
I belive you, my observation was different. My question is since the WU's don't acually use 24 or 32 cores then what is a good number to correct the misconfiguration |
Send message Joined: 14 Jan 10 Posts: 1422 Credit: 9,484,585 RAC: 1,882 |
@Toby: Thanks for your image. I think you have 8 cmsExternalGenerator processes too. I've also sometimes seen less than 8, but always when other processes eating a lot of CPU like in your image cvmfs2. I've seen cvmfs2's using up to 500% cpu. The job-processes are suppressed lower on the 'top' list under that circumstance. Maybe we should set 4 to the number of CPUs in preferences and to be sure use app_config.xml. |
Send message Joined: 14 Jan 10 Posts: 1422 Credit: 9,484,585 RAC: 1,882 |
The single core tasks are running for half an hour and then stopped without having done something usefull: https://lhcathome.cern.ch/lhcathome/result.php?resultid=410051723 https://lhcathome.cern.ch/lhcathome/result.php?resultid=410055559 |
Send message Joined: 15 Jun 08 Posts: 2541 Credit: 254,608,838 RAC: 56,545 |
ATM there are only 4-core jobs in the backend queue. The singlecore backend queue is empty. |
Send message Joined: 7 Aug 11 Posts: 104 Credit: 25,221,969 RAC: 25,711 |
The single core back end that's cached now at CERN Is completely gone they said The single core back end that's cached now at CERN Is completely gone ... And still They come! <to the tune of The Eve of the War - Jeff Wayne's War of the Worlds> |
Send message Joined: 27 Sep 08 Posts: 850 Credit: 692,713,859 RAC: 95,524 |
Make sense, maybe the other processes are getting the next batch of work, then it will go to 8. I set to 8 cores, seems to load up the CPU OK. |
Send message Joined: 24 Oct 04 Posts: 1174 Credit: 54,887,670 RAC: 8,563 |
It sure would be nice if single core and multi-core were made separate from each other in the settings I had one set to run 8 cores and it does this https://lhcathome.cern.ch/lhcathome/result.php?resultid=409980514 And if I try 4 cores it switches back to CMS Simulation v70.20 (vbox64) windows_x86_64 (again just now) And then over at -dev they run what I want them to run with CMS Simulation v60.70 (vbox64_mt_mcore_cms) windows_x86_64 (and another problem is I have three matching 8-core hosts and some will run here and not at -dev and the exact opposite too and I have tried complete clean reinstalls of everything and they will d/l the vdi and then the tasks just crash.....so I have to keep track of which Theory or CMS will run on them from here and -dev) |
Send message Joined: 7 Aug 11 Posts: 104 Credit: 25,221,969 RAC: 25,711 |
Reset the project, made sure it's set to use 4 cores, Atlas native is running ok on four cores (been playing with HDDs after I had a failure so there's some errored and aborted tasks in my records), Theory is as reliable as ever <sarcasm>, but CMS just won't grab any of the multi-core work but keeps getting single core jobs that supposedly aren't even in the queue. |
Send message Joined: 15 Jun 08 Posts: 2541 Credit: 254,608,838 RAC: 56,545 |
This is (as of now) your most recently returned CMS task. https://lhcathome.cern.ch/lhcathome/result.php?resultid=410078311 The VM was a 1-core VM: 2024-04-21 16:17:27 (3178473): Setting CPU Count for VM. (1) The task ran the envelope but didn't get a CMS job since the 1-core job queue is still dry. Be aware that the envelope queue and the job queue are different. The latter is much deeper in the process and has no direct connection to BOINC. A good indicator is to compare runtime with CPU time. Here: 33 min 40 sec vs. 2 min 9 sec This means the VM tried a couple of times without success to get a job and finally gave up. Since the short runtimes confuse BOINC's work fetch algorithm you will now get (in connection with a large work buffer) far too many CMS envelopes. Once the job queue starts again to send jobs this may lead to a situation where your computer can't return all envelopes before the deadline. Hence, keep your work buffer as small as possible. |
©2024 CERN