CMS@Home difficulties in attempts to prepare for multi-core jobs

Author	Message
Dark Angel Send message Joined: 7 Aug 11 Posts: 93 Credit: 21,875,393 RAC: 8,932	Message 49963 - Posted: 18 Apr 2024, 2:50:32 UTC I'm still only getting single core work units at this stage though my profile is set for four cores (for Atlas jobs originally) I have a few to get through so I'll just watch as see what pops up. ID: 49963 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1291 Credit: 8,544,599 RAC: 3,954	Message 49965 - Posted: 18 Apr 2024, 9:33:14 UTC Last modified: 18 Apr 2024, 9:34:27 UTC I'll give this multi-core on production server a try. First three tasks had an error cause the new downloaded CMS_2022_09_07.vdi had the same UUID as that one from the dev-system. I resetted the dev-project on my PC and removed the hard disks from VirtualBox media. I also removed my app_config.xml to see what is coming from the server without intervention. I had set 1 task and no limit on CPUs in my project-preferences. Now the task started OK and after a while started processing internal jobs. A 24-core VM was created (no limit) and I see 2 processes cmsRun (each ~14% CPU) and 8 processes cmsExternalGene each consuming ~96% CPU. ID: 49965 · Reply Quote

Toby Broom Volunteer moderator Send message Joined: 27 Sep 08 Posts: 810 Credit: 655,376,811 RAC: 213,017	Message 49968 - Posted: 18 Apr 2024, 16:11:01 UTC I see that one WU allocates 32 cores and then inside there is 6 processes cmsExternalGene using 1 core each. What is the expected max number inside, as CP say seems like maybe 8? ID: 49968 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1009 Credit: 6,290,161 RAC: 1,282	Message 49969 - Posted: 18 Apr 2024, 17:42:38 UTC - in response to Message 49968. I see that one WU allocates 32 cores and then inside there is 6 processes cmsExternalGene using 1 core each. What is the expected max number inside, as CP say seems like maybe 8? As far as I know, the tasks that run the new 4-core jobs should run on 4 cores no matter how many above that number you have allowed in your locale preferences. My experience is that the main process, cmsRun, spawns four threads, each running cmsExternalGenerator, so in your "top" display (Alt-F3) you should see four cmsExternalGenerator processes running at nearly 100% each, with the occasional appearance of the cmsRun master process as it gets its share of the resources.. ID: 49969 · Reply Quote

Toby Broom Volunteer moderator Send message Joined: 27 Sep 08 Posts: 810 Credit: 655,376,811 RAC: 213,017	Message 49970 - Posted: 18 Apr 2024, 18:04:40 UTC - in response to Message 49969. OK, I lock it down to 4 cores and 4.5 GB of memory ID: 49970 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1291 Credit: 8,544,599 RAC: 3,954	Message 49971 - Posted: 18 Apr 2024, 18:52:11 UTC - in response to Message 49969. Last modified: 18 Apr 2024, 18:56:36 UTC My experience is that the main process, cmsRun, spawns four threads, each running cmsExternalGenerator, so in your "top" display (Alt-F3) you should see four cmsExternalGenerator processes running at nearly 100% each, with the occasional appearance of the cmsRun master process as it gets its share of the resources.. Did you read my post ? Especially the last sentence: A 24-core VM was created (no limit) and I see 2 processes cmsRun (each ~14% CPU) and 8 processes cmsExternalGene each consuming ~96% CPU. The 2 cmsRuns are constantly running using ~13-15% CPU, during a whole run of the 8 cmsExternalGenerator processes. ID: 49971 · Reply Quote

Toby Broom Volunteer moderator Send message Joined: 27 Sep 08 Posts: 810 Credit: 655,376,811 RAC: 213,017	Message 49972 - Posted: 19 Apr 2024, 5:39:21 UTC - in response to Message 49971. @CP yes, I'm not sure why mine had 6 processes. 8 at 100% seems like it would need 8 cores? which is what I set to initally, Ivan seemed to say that 4 was good. I additionally see that each WU allocates 30 GB of working set so I have to think about how to get the sceduler to be OK. ID: 49972 · Reply Quote

Magic Quantum Mechanic Send message Joined: 24 Oct 04 Posts: 1130 Credit: 49,839,863 RAC: 8,537	Message 49973 - Posted: 19 Apr 2024, 5:55:04 UTC Last modified: 19 Apr 2024, 6:25:57 UTC https://lhcathome.cern.ch/lhcathome/result.php?resultid=409980539 https://lhcathome.cern.ch/lhcathome/result.php?resultid=409923306 https://lhcathome.cern.ch/lhcathome/result.php?resultid=409860352 (all same host) I keep trying a clean install of the CMS multi here on another host that is exactly the same as the one that works and it keeps giving me Application CMS Simulation 70.20 (vbox64) Name CMS_607434_1713506482.423690 State Downloading Received 4/18/2024 11:16:04 PM Report deadline 5/18/2024 11:16:02 PM Estimated computation size 1,000,000 GFLOPs Executable vboxwrapper_26206_windows_x86_64.exe ID: 49973 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1291 Credit: 8,544,599 RAC: 3,954	Message 49974 - Posted: 19 Apr 2024, 7:46:33 UTC Last modified: 19 Apr 2024, 8:26:29 UTC For the non believers: ID: 49974 · Reply Quote

Magic Quantum Mechanic Send message Joined: 24 Oct 04 Posts: 1130 Credit: 49,839,863 RAC: 8,537	Message 49975 - Posted: 19 Apr 2024, 10:21:00 UTC - in response to Message 49974. We have non believers CP ? ID: 49975 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1693 Credit: 104,860,854 RAC: 76,036	Message 49976 - Posted: 19 Apr 2024, 16:09:39 UTC no jobs for several hours, but the automatic stop of tasks distribution does not seem to work :-( Thus causing thousands of useless tasks being uploaded after about half an hour runtime without results for the science :-( ID: 49976 · Reply Quote

Toby Broom Volunteer moderator Send message Joined: 27 Sep 08 Posts: 810 Credit: 655,376,811 RAC: 213,017	Message 49977 - Posted: 19 Apr 2024, 18:36:52 UTC - in response to Message 49974. Last modified: 20 Apr 2024, 7:30:23 UTC I belive you, my observation was different. My question is since the WU's don't acually use 24 or 32 cores then what is a good number to correct the misconfiguration ID: 49977 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1291 Credit: 8,544,599 RAC: 3,954	Message 49979 - Posted: 20 Apr 2024, 9:02:11 UTC - in response to Message 49977. @Toby: Thanks for your image. I think you have 8 cmsExternalGenerator processes too. I've also sometimes seen less than 8, but always when other processes eating a lot of CPU like in your image cvmfs2. I've seen cvmfs2's using up to 500% cpu. The job-processes are suppressed lower on the 'top' list under that circumstance. Maybe we should set 4 to the number of CPUs in preferences and to be sure use app_config.xml. ID: 49979 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1291 Credit: 8,544,599 RAC: 3,954	Message 49980 - Posted: 20 Apr 2024, 9:54:12 UTC The single core tasks are running for half an hour and then stopped without having done something usefull: https://lhcathome.cern.ch/lhcathome/result.php?resultid=410051723 https://lhcathome.cern.ch/lhcathome/result.php?resultid=410055559 ID: 49980 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2435 Credit: 228,142,882 RAC: 122,954	Message 49981 - Posted: 20 Apr 2024, 10:20:31 UTC - in response to Message 49980. ATM there are only 4-core jobs in the backend queue. The singlecore backend queue is empty. ID: 49981 · Reply Quote

Dark Angel Send message Joined: 7 Aug 11 Posts: 93 Credit: 21,875,393 RAC: 8,932	Message 49982 - Posted: 20 Apr 2024, 11:46:24 UTC The single core back end that's cached now at CERN Is completely gone they said The single core back end that's cached now at CERN Is completely gone ... And still They come! <to the tune of The Eve of the War - Jeff Wayne's War of the Worlds> ID: 49982 · Reply Quote

Toby Broom Volunteer moderator Send message Joined: 27 Sep 08 Posts: 810 Credit: 655,376,811 RAC: 213,017	Message 49983 - Posted: 20 Apr 2024, 13:57:53 UTC - in response to Message 49979. Make sense, maybe the other processes are getting the next batch of work, then it will go to 8. I set to 8 cores, seems to load up the CPU OK. ID: 49983 · Reply Quote

Magic Quantum Mechanic Send message Joined: 24 Oct 04 Posts: 1130 Credit: 49,839,863 RAC: 8,537	Message 49984 - Posted: 21 Apr 2024, 0:04:39 UTC Last modified: 21 Apr 2024, 0:06:09 UTC It sure would be nice if single core and multi-core were made separate from each other in the settings I had one set to run 8 cores and it does this https://lhcathome.cern.ch/lhcathome/result.php?resultid=409980514 And if I try 4 cores it switches back to CMS Simulation v70.20 (vbox64) windows_x86_64 (again just now) And then over at -dev they run what I want them to run with CMS Simulation v60.70 (vbox64_mt_mcore_cms) windows_x86_64 (and another problem is I have three matching 8-core hosts and some will run here and not at -dev and the exact opposite too and I have tried complete clean reinstalls of everything and they will d/l the vdi and then the tasks just crash.....so I have to keep track of which Theory or CMS will run on them from here and -dev) ID: 49984 · Reply Quote

Dark Angel Send message Joined: 7 Aug 11 Posts: 93 Credit: 21,875,393 RAC: 8,932	Message 49987 - Posted: 21 Apr 2024, 6:16:24 UTC Reset the project, made sure it's set to use 4 cores, Atlas native is running ok on four cores (been playing with HDDs after I had a failure so there's some errored and aborted tasks in my records), Theory is as reliable as ever <sarcasm>, but CMS just won't grab any of the multi-core work but keeps getting single core jobs that supposedly aren't even in the queue. ID: 49987 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2435 Credit: 228,142,882 RAC: 122,954	Message 49988 - Posted: 21 Apr 2024, 7:09:57 UTC - in response to Message 49987. This is (as of now) your most recently returned CMS task. https://lhcathome.cern.ch/lhcathome/result.php?resultid=410078311 The VM was a 1-core VM: 2024-04-21 16:17:27 (3178473): Setting CPU Count for VM. (1) The task ran the envelope but didn't get a CMS job since the 1-core job queue is still dry. Be aware that the envelope queue and the job queue are different. The latter is much deeper in the process and has no direct connection to BOINC. A good indicator is to compare runtime with CPU time. Here: 33 min 40 sec vs. 2 min 9 sec This means the VM tried a couple of times without success to get a job and finally gave up. Since the short runtimes confuse BOINC's work fetch algorithm you will now get (in connection with a large work buffer) far too many CMS envelopes. Once the job queue starts again to send jobs this may lead to a situation where your computer can't return all envelopes before the deadline. Hence, keep your work buffer as small as possible. ID: 49988 · Reply Quote

LHC@home