Multithreading/Multicore?

Author	Message
microchip Send message Joined: 27 Jun 06 Posts: 10 Credit: 3,216,130 RAC: 4,551	Message 48981 - Posted: 6 Dec 2023, 0:56:51 UTC Why is CMS not threaded/using multiple cores like ATLAS does? Running on a single core sometimes takes a whole day to complete a single CMS Wu. Can it be made to run on multiple cores? ID: 48981 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2267 Credit: 175,671,719 RAC: 30	Message 48982 - Posted: 6 Dec 2023, 1:20:38 UTC - in response to Message 48981. 60.70 (vbox64_mt_mcore_cms) This is the Testversion in -dev Windows and Linux. ID: 48982 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2636 Credit: 277,013,883 RAC: 145,659	Message 48983 - Posted: 6 Dec 2023, 7:31:43 UTC - in response to Message 48981. Multithreading/Multicore is not necessarily more efficient. The CMS scientific app runs in singlecore mode by design and each subtask typically takes 3-4 h. CMS VMs usually run a couple of them consecutively until the target time of 12 h is reached. This is a weak target meaning sometimes a VM runs only 1 (or 2 ...) subtask(s) for whatever reason and sometimes the runtime extends to more than 12 h to allow a subtask in progress to finish. The VMs sent out by -dev can be multicore VMs. In the past there were tests to run n singlecore CMS subtasks concurrently on the same n-core VM (each subtask in a separate slot). This has been stopped long ago and since then the backend queues are configured to send only 1 subtask to a VM regardless of the VM's core count. Hence, even if it is possible to run multicore CMS VMs on -dev n-1 (with n>1) CPUs will just sit there doing nothing (except a small fraction of a core used for the OS and peripheral helper processes). ID: 48983 · Reply Quote

microchip Send message Joined: 27 Jun 06 Posts: 10 Credit: 3,216,130 RAC: 4,551	Message 48986 - Posted: 6 Dec 2023, 18:20:16 UTC @computezrmle I have no idea what you just said. English? @maeax Good to know! I don't run test apps, though. Will wait. ID: 48986 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2636 Credit: 277,013,883 RAC: 145,659	Message 48987 - Posted: 6 Dec 2023, 19:54:24 UTC - in response to Message 48986. I have no idea what you just said. OK, in short. It is not planned to publish a multithreading/multicore CMS application. ID: 48987 · Reply Quote

fastbunny Send message Joined: 7 Nov 23 Posts: 2 Credit: 37,618 RAC: 0	Message 49047 - Posted: 17 Dec 2023, 13:11:47 UTC What is the reason for filling one task with multiple subtasks? I would very much prefer the work units to be shorter. The reason I'm asking, is because the CMS tasks are problematic because they cannot be suspended. If you resume them after suspending, they will almost always fail, either with a generic compute error, or with a small pop-up window saying 'breakpoint reached'. Do you know why this is so difficult for CMS tasks? I had been wondering why sometimes I would still get credit for these tasks, but now I understand that some of the subtasks may still have been succesfully completed earlier. The result of this, is that I will only allow LHC@home to get new work if I know I can leave my computer running for 12 hours or more, which is not often. If not, I cannot reliably complete the tasks, because I cannot shut down and resume. I have tried, by suspending BOINC first, then checking VirtualBox to see whether the VMs have paused and saved correctly, but still I guess >95% of work units fail when you resume after suspend. I guess you could consider this post as a friendly request to either make the work units shorter, or to make them properly handle pausing and resuming. ID: 49047 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2636 Credit: 277,013,883 RAC: 145,659	Message 49048 - Posted: 17 Dec 2023, 18:56:13 UTC - in response to Message 49047. What is the reason for filling one task with multiple subtasks? Mainly to update the CVMFS cache only once at the beginning of the task and use the data for a couple of subtasks. In addition each task runs some benchmarks at the beginning to get an idea how many subtasks could be run within the 12 h target. That target plus a 6 h grace period results in the 18 h max task runtime. At the end of the 18 h runtime a task will be cancelled forcefully which avoids tasks that got stuck will run forever. At the end this is already a compromise between the scientific apps being developed to run 24/7 in a datacenter environment and the BOINC environment. I would very much prefer the work units to be shorter. The scientists prefer subtasks with even more events which would lead to longer runtimes. ATM average core time per subtask is 2-4 h. CMS tasks are problematic because they cannot be suspended CMS allows tasks to be suspended up to 2 h. After this period the subtask is marked as lost by the backend systems. If you resume them after suspending, they will almost always fail, either with a generic compute error, or with a small pop-up window saying 'breakpoint reached'. Very unusual. Since BOINC suspends the whole VM it should continue from the same point. You may check if "leave non GPU apps in RAM if suspended" is enabled. In addition the task's runtime should already be longer than the global checkpoint interval set for your BOINC client. Also a good idea: make your computers visible and post links to example tasks that are marked as failed. but still I guess >95% of work units fail when you resume after suspend You don't mean 95 % of all tasks sent out by the server, do you? ID: 49048 · Reply Quote

fastbunny Send message Joined: 7 Nov 23 Posts: 2 Credit: 37,618 RAC: 0	Message 49049 - Posted: 17 Dec 2023, 19:59:40 UTC - in response to Message 49048. Thank you for your reply. It is useful to know a task can be suspended for up to two hours but not any longer. My 95% (guestimate) fail rate with suspended tasks is indeed with tasks I have suspended for a whole night for example. I did not mean that 95% of tasks in general fail, only that when they're suspended for a while, 95% of those will fail after resuming. It is now clear why that happens. I have enabled the 'leave in RAM' option now; thanks for the advice. Checkpoint interval was already set to 60 seconds which should be good I think. ID: 49049 · Reply Quote

microchip Send message Joined: 27 Jun 06 Posts: 10 Credit: 3,216,130 RAC: 4,551	Message 49050 - Posted: 18 Dec 2023, 17:24:22 UTC I have no problem with suspending CMS units. My desktop is usually up 24/7 and I only occasionally reboot due to some update that requires it (eg, systemd) or a new kernel. CMS happily resumes after the system is back up. I've never had a task error out because it was suspended and then resumed. This is on 2 different Linux systems. ID: 49050 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1875 Credit: 138,796,021 RAC: 64,129	Message 49052 - Posted: 19 Dec 2023, 6:34:46 UTC - in response to Message 49050. I have no problem with suspending CMS units. My desktop is usually up 24/7 and I only occasionally reboot due to some update that requires it I assume by doing this, your tasks are being suspended only for short time. As computezrmle wrote above, suspending up to2 hours should not be a problem anyway. ID: 49052 · Reply Quote

LHC@home