Message boards : CMS Application : Multithreading/Multicore?
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile microchip
Avatar

Send message
Joined: 27 Jun 06
Posts: 8
Credit: 2,592,725
RAC: 2,536
Message 48981 - Posted: 6 Dec 2023, 0:56:51 UTC

Why is CMS not threaded/using multiple cores like ATLAS does? Running on a single core sometimes takes a whole day to complete a single CMS Wu. Can it be made to run on multiple cores?
ID: 48981 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2244
Credit: 173,902,375
RAC: 456
Message 48982 - Posted: 6 Dec 2023, 1:20:38 UTC - in response to Message 48981.  

60.70 (vbox64_mt_mcore_cms)
This is the Testversion in -dev Windows and Linux.
ID: 48982 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2541
Credit: 254,608,838
RAC: 34,609
Message 48983 - Posted: 6 Dec 2023, 7:31:43 UTC - in response to Message 48981.  

Multithreading/Multicore is not necessarily more efficient.
The CMS scientific app runs in singlecore mode by design and each subtask typically takes 3-4 h.
CMS VMs usually run a couple of them consecutively until the target time of 12 h is reached.
This is a weak target meaning sometimes a VM runs only 1 (or 2 ...) subtask(s) for whatever reason and sometimes the runtime extends to more than 12 h to allow a subtask in progress to finish.

The VMs sent out by -dev can be multicore VMs.
In the past there were tests to run n singlecore CMS subtasks concurrently on the same n-core VM (each subtask in a separate slot).
This has been stopped long ago and since then the backend queues are configured to send only 1 subtask to a VM regardless of the VM's core count.
Hence, even if it is possible to run multicore CMS VMs on -dev n-1 (with n>1) CPUs will just sit there doing nothing (except a small fraction of a core used for the OS and peripheral helper processes).
ID: 48983 · Report as offensive     Reply Quote
Profile microchip
Avatar

Send message
Joined: 27 Jun 06
Posts: 8
Credit: 2,592,725
RAC: 2,536
Message 48986 - Posted: 6 Dec 2023, 18:20:16 UTC

@computezrmle I have no idea what you just said. English?

@maeax Good to know! I don't run test apps, though. Will wait.
ID: 48986 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2541
Credit: 254,608,838
RAC: 34,609
Message 48987 - Posted: 6 Dec 2023, 19:54:24 UTC - in response to Message 48986.  

I have no idea what you just said.

OK, in short.
It is not planned to publish a multithreading/multicore CMS application.
ID: 48987 · Report as offensive     Reply Quote
fastbunny

Send message
Joined: 7 Nov 23
Posts: 2
Credit: 37,618
RAC: 0
Message 49047 - Posted: 17 Dec 2023, 13:11:47 UTC

What is the reason for filling one task with multiple subtasks? I would very much prefer the work units to be shorter.

The reason I'm asking, is because the CMS tasks are problematic because they cannot be suspended. If you resume them after suspending, they will almost always fail, either with a generic compute error, or with a small pop-up window saying 'breakpoint reached'. Do you know why this is so difficult for CMS tasks?
I had been wondering why sometimes I would still get credit for these tasks, but now I understand that some of the subtasks may still have been succesfully completed earlier.

The result of this, is that I will only allow LHC@home to get new work if I know I can leave my computer running for 12 hours or more, which is not often. If not, I cannot reliably complete the tasks, because I cannot shut down and resume. I have tried, by suspending BOINC first, then checking VirtualBox to see whether the VMs have paused and saved correctly, but still I guess >95% of work units fail when you resume after suspend.

I guess you could consider this post as a friendly request to either make the work units shorter, or to make them properly handle pausing and resuming.
ID: 49047 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2541
Credit: 254,608,838
RAC: 34,609
Message 49048 - Posted: 17 Dec 2023, 18:56:13 UTC - in response to Message 49047.  

What is the reason for filling one task with multiple subtasks?

Mainly to update the CVMFS cache only once at the beginning of the task and use the data for a couple of subtasks.
In addition each task runs some benchmarks at the beginning to get an idea how many subtasks could be run within the 12 h target.
That target plus a 6 h grace period results in the 18 h max task runtime.
At the end of the 18 h runtime a task will be cancelled forcefully which avoids tasks that got stuck will run forever.

At the end this is already a compromise between the scientific apps being developed to run 24/7 in a datacenter environment and the BOINC environment.


I would very much prefer the work units to be shorter.

The scientists prefer subtasks with even more events which would lead to longer runtimes.
ATM average core time per subtask is 2-4 h.


CMS tasks are problematic because they cannot be suspended

CMS allows tasks to be suspended up to 2 h.
After this period the subtask is marked as lost by the backend systems.


If you resume them after suspending, they will almost always fail, either with a generic compute error, or with a small pop-up window saying 'breakpoint reached'.

Very unusual.
Since BOINC suspends the whole VM it should continue from the same point.
You may check if "leave non GPU apps in RAM if suspended" is enabled.
In addition the task's runtime should already be longer than the global checkpoint interval set for your BOINC client.
Also a good idea: make your computers visible and post links to example tasks that are marked as failed.


but still I guess >95% of work units fail when you resume after suspend

You don't mean 95 % of all tasks sent out by the server, do you?
ID: 49048 · Report as offensive     Reply Quote
fastbunny

Send message
Joined: 7 Nov 23
Posts: 2
Credit: 37,618
RAC: 0
Message 49049 - Posted: 17 Dec 2023, 19:59:40 UTC - in response to Message 49048.  

Thank you for your reply. It is useful to know a task can be suspended for up to two hours but not any longer. My 95% (guestimate) fail rate with suspended tasks is indeed with tasks I have suspended for a whole night for example. I did not mean that 95% of tasks in general fail, only that when they're suspended for a while, 95% of those will fail after resuming. It is now clear why that happens.

I have enabled the 'leave in RAM' option now; thanks for the advice. Checkpoint interval was already set to 60 seconds which should be good I think.
ID: 49049 · Report as offensive     Reply Quote
Profile microchip
Avatar

Send message
Joined: 27 Jun 06
Posts: 8
Credit: 2,592,725
RAC: 2,536
Message 49050 - Posted: 18 Dec 2023, 17:24:22 UTC

I have no problem with suspending CMS units. My desktop is usually up 24/7 and I only occasionally reboot due to some update that requires it (eg, systemd) or a new kernel. CMS happily resumes after the system is back up. I've never had a task error out because it was suspended and then resumed. This is on 2 different Linux systems.
ID: 49050 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1821
Credit: 118,941,165
RAC: 22,029
Message 49052 - Posted: 19 Dec 2023, 6:34:46 UTC - in response to Message 49050.  

I have no problem with suspending CMS units. My desktop is usually up 24/7 and I only occasionally reboot due to some update that requires it
I assume by doing this, your tasks are being suspended only for short time.
As computezrmle wrote above, suspending up to2 hours should not be a problem anyway.
ID: 49052 · Report as offensive     Reply Quote

Message boards : CMS Application : Multithreading/Multicore?


©2024 CERN