Thread 'Why Such Varied Runtimes?'

Author	Message
rbpeake Send message Joined: 17 Sep 04 Posts: 106 Credit: 36,686,048 RAC: 6,030	Message 45477 - Posted: 20 Oct 2021, 13:13:14 UTC I'm just curious why work unit runtimes vary so much for CMS (on one consistent machine)? Thanks. Regards, Bob P. ID: 45477 · Reply Quote

Toby Broom Volunteer moderator Send message Joined: 27 Sep 08 Posts: 952 Credit: 785,101,604 RAC: 120,393	Message 45481 - Posted: 20 Oct 2021, 17:30:52 UTC - in response to Message 45477. I assume they are somewhat similar to Sixtrac in that if the particles crash in to the wall then that the end of the WU, I can't say for sure if that true. For me they are also +/- 2 hours for the same computer. ID: 45481 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1163 Credit: 11,930,268 RAC: 7,617	Message 45482 - Posted: 20 Oct 2021, 18:48:51 UTC - in response to Message 45477. Last modified: 20 Oct 2021, 18:50:11 UTC I'm just curious why work unit runtimes vary so much for CMS (on one consistent machine)? Thanks. Well, you are getting errors on your one machine that runs CMS tasks. I can't be much more specific at the moment as I'm at home with poor connectivity -- and my uni has hobbled my laptop and desktop. A CMS "task" is an instance of a virtual machine -- that 1.5 GB file that you downloaded when you started up -- running inside the VirtualBox supervisor (hypervisor). If the VM doesn't start up cleanly, then the task will fail. When the VM _does_ start up, it requests a CMS "job" from an HTCondor batch job server. This job may fail for a number of reasons, most commonly network problems. Unfortunately these failures aren't always relayed to the BOINC task, so one task may suffer multiple job failures in its lifetime. Now, on most modern machines, each properly-running job takes about two hours to complete, generating 10,000 simulated proton-proton collisions within the CMS detector system. At this point, the job is terminated and its result and log files are sent back to CERN for storage (and later merging into larger files on dedicated storage). Jobs can also fail in this "stage-out" step. Once the job is finished, if the VM is less than 12 hours old, it requests another job from the condor pool, otherwise it terminates and your BOINC task ends. An easy way to see if your tasks are misbehaving is to look at you work-unit result page on the LHC@Home site. If you see a large discrepancy between the CPU time and running time, you probably have a problem. If your tasks consistently take less than 12 hours to run, you probably also have a problem. ID: 45482 · Reply Quote

rbpeake Send message Joined: 17 Sep 04 Posts: 106 Credit: 36,686,048 RAC: 6,030	Message 45488 - Posted: 20 Oct 2021, 22:56:14 UTC - in response to Message 45482. Thank you for the detailed reply. Itâ€™s interesting to know that jobs will keep coming up to a 12-hour age for the VM. Those recent CMS failures were due to a reboot. Iâ€™ll be more careful next time!ðŸ‘ Regards, Bob P. ID: 45488 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1163 Credit: 11,930,268 RAC: 7,617	Message 45492 - Posted: 21 Oct 2021, 7:44:09 UTC - in response to Message 45488. Thank you for the detailed reply. Itâ€™s interesting to know that jobs will keep coming up to a 12-hour age for the VM. Those recent CMS failures were due to a reboot. Iâ€™ll be more careful next time!ðŸ‘ Yes. It's important that you shut BOINC down carefully with boincmgr ot boinccmd, and give the VM time to store its state (2 minutes should be enough) before you switch off or reboot. Then, as long as the interruption isn't too long, the task will resume from where it was when BOINC is restarted. ID: 45492 · Reply Quote

Jim1348 Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0	Message 45493 - Posted: 21 Oct 2021, 9:50:32 UTC - in response to Message 45492. Could you take a look at mine? I see a large variation too, as compared to what it used to be with 50.00, when it used to run for 13 1/2 hours and was done with it. Now they vary from 1/2 hour to 12 1/2 hours, though it does not seem to be due to errors to any great extent. https://lhcathome.cern.ch/lhcathome/results.php?hostid=10697859&offset=0&show_names=0&state=4&appid= I assumed the new app was allowing the experimenters to do different things, though what they were doing all along is a mystery to me anyway,. ID: 45493 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2760 Credit: 305,211,257 RAC: 117,038	Message 45495 - Posted: 21 Oct 2021, 10:28:49 UTC - in response to Message 45493. I also see those short runtimes (<12 h). The BOINC clients in question are running only CMS 24/7 without any interruption. Uploads (stage-out) work fine. The logfiles do not show any hints that point out any errors nor do my other logs. @Ivan Be so kind as to investigate whether a script deeper in the process or a configuration independent from BOINC (Condor/WMAgent) causes a task not to get follow-up jobs (=subtasks). ID: 45495 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1163 Credit: 11,930,268 RAC: 7,617	Message 45601 - Posted: 3 Nov 2021, 14:39:24 UTC - in response to Message 45495. Last modified: 3 Nov 2021, 14:41:48 UTC I also see those short runtimes (<12 h). The BOINC clients in question are running only CMS 24/7 without any interruption. Uploads (stage-out) work fine. The logfiles do not show any hints that point out any errors nor do my other logs. @Ivan Be so kind as to investigate whether a script deeper in the process or a configuration independent from BOINC (Condor/WMAgent) causes a task not to get follow-up jobs (=subtasks). The 21st Oct was when we were updating the WMAgent, so no jobs would have been available then. [Edit] Oops, no, I misread the chart... [/Edit] Jim seems to have had task times >12 hrs in recent times. Yours are still a little short, I hope your 28 machines /430 cores aren't taxing your network bandwidth. ðŸ™‚ ID: 45601 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2760 Credit: 305,211,257 RAC: 117,038	Message 45602 - Posted: 3 Nov 2021, 15:27:53 UTC - in response to Message 45601. ... I hope your 28 machines /430 cores aren't taxing your network bandwidth. ðŸ™‚ Since I run multiple BOINC clients on the same box those numbers are not real. I'm usually running around 35-37 CMS and 35-37 ATLAS tasks concurrently which results in a download saturation of <5% and an upload saturation of <15%. Unfortunately this afternoon my ISP had a major DSLAM outage that affected a couple of households. 3rd time within a few weeks :-( May cause some tasks to fail. ID: 45602 · Reply Quote

Jim1348 Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0	Message 45603 - Posted: 3 Nov 2021, 15:57:23 UTC Yes, they have been well-behaved for me the last week or so, averaging a little over 12 hours. But I am learning new things about running on 23 cores of a Ryzen 3900X (reserving one core for a GPU). I received an "out of disk" space error message, and saw that my 250 GB SSD was in fact a little low, which was a surprise. So I upgraded to 500 GB, which should be plenty, but recently got the error message again. You have to set the BOINC "Disk usage" parameters manually, and not rely on the defaults. So I am using 500 GB max, and 100% of total disk space. That seems to have satisfied it, for the moment. ID: 45603 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1163 Credit: 11,930,268 RAC: 7,617	Message 45605 - Posted: 3 Nov 2021, 18:56:01 UTC - in response to Message 45603. Last modified: 3 Nov 2021, 19:01:36 UTC Yes, they have been well-behaved for me the last week or so, averaging a little over 12 hours. But I am learning new things about running on 23 cores of a Ryzen 3900X (reserving one core for a GPU). I received an "out of disk" space error message, and saw that my 250 GB SSD was in fact a little low, which was a surprise. So I upgraded to 500 GB, which should be plenty, but recently got the error message again. You have to set the BOINC "Disk usage" parameters manually, and not rely on the defaults. So I am using 500 GB max, and 100% of total disk space. That seems to have satisfied it, for the moment. Good to hear. In the early days of BOINC (which mostly meant SETI@Home) you could let your install "fly blind", but with much more sophisticated applications these days (including such as ours, using VMs and/or GPUs), a little bit of oversight is often necessary. As you've found out, this is especially true in terms of file sizes and network bandwidth. Within CMS we now routinely deal with data files of at least 2 GB (in fact, our sister site, T3_CH_CMSAtHome, which runs at CERN, does all the "post-production" tasks for CMS@Home, and its major task is JobMerge where it takes your 60-80 MB result files and gathers them into files ~2 GB and stores them into our central storage system. In case you didn't know, our production pseudo-site is known in CMS as T3_CH_Volunteer, which translates as "Tier 3" (Tier 0 is CERN, Tier 1 are national facilities, Tier 2 are regional centres such as London, and Tier 3 are seen as institutional set-ups), and CH means Switzerland (obviously...). ID: 45605 · Reply Quote