Message boards : CMS Application : Why Such Varied Runtimes?
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile rbpeake

Send message
Joined: 17 Sep 04
Posts: 99
Credit: 30,618,118
RAC: 3,938
Message 45477 - Posted: 20 Oct 2021, 13:13:14 UTC

I'm just curious why work unit runtimes vary so much for CMS (on one consistent machine)?
Thanks.
Regards,
Bob P.
ID: 45477 · Report as offensive     Reply Quote
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 798
Credit: 644,678,398
RAC: 235,495
Message 45481 - Posted: 20 Oct 2021, 17:30:52 UTC - in response to Message 45477.  

I assume they are somewhat similar to Sixtrac in that if the particles crash in to the wall then that the end of the WU, I can't say for sure if that true.

For me they are also +/- 2 hours for the same computer.
ID: 45481 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 997
Credit: 6,264,307
RAC: 71
Message 45482 - Posted: 20 Oct 2021, 18:48:51 UTC - in response to Message 45477.  
Last modified: 20 Oct 2021, 18:50:11 UTC

I'm just curious why work unit runtimes vary so much for CMS (on one consistent machine)?
Thanks.

Well, you are getting errors on your one machine that runs CMS tasks. I can't be much more specific at the moment as I'm at home with poor connectivity -- and my uni has hobbled my laptop and desktop.
A CMS "task" is an instance of a virtual machine -- that 1.5 GB file that you downloaded when you started up -- running inside the VirtualBox supervisor (hypervisor). If the VM doesn't start up cleanly, then the task will fail.
When the VM _does_ start up, it requests a CMS "job" from an HTCondor batch job server. This job may fail for a number of reasons, most commonly network problems. Unfortunately these failures aren't always relayed to the BOINC task, so one task may suffer multiple job failures in its lifetime.
Now, on most modern machines, each properly-running job takes about two hours to complete, generating 10,000 simulated proton-proton collisions within the CMS detector system. At this point, the job is terminated and its result and log files are sent back to CERN for storage (and later merging into larger files on dedicated storage). Jobs can also fail in this "stage-out" step. Once the job is finished, if the VM is less than 12 hours old, it requests another job from the condor pool, otherwise it terminates and your BOINC task ends.
An easy way to see if your tasks are misbehaving is to look at you work-unit result page on the LHC@Home site. If you see a large discrepancy between the CPU time and running time, you probably have a problem. If your tasks consistently take less than 12 hours to run, you probably also have a problem.
ID: 45482 · Report as offensive     Reply Quote
Profile rbpeake

Send message
Joined: 17 Sep 04
Posts: 99
Credit: 30,618,118
RAC: 3,938
Message 45488 - Posted: 20 Oct 2021, 22:56:14 UTC - in response to Message 45482.  

Thank you for the detailed reply. It’s interesting to know that jobs will keep coming up to a 12-hour age for the VM.
Those recent CMS failures were due to a reboot. I’ll be more careful next time!👍
Regards,
Bob P.
ID: 45488 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 997
Credit: 6,264,307
RAC: 71
Message 45492 - Posted: 21 Oct 2021, 7:44:09 UTC - in response to Message 45488.  

Thank you for the detailed reply. It’s interesting to know that jobs will keep coming up to a 12-hour age for the VM.
Those recent CMS failures were due to a reboot. I’ll be more careful next time!👍

Yes. It's important that you shut BOINC down carefully with boincmgr ot boinccmd, and give the VM time to store its state (2 minutes should be enough) before you switch off or reboot. Then, as long as the interruption isn't too long, the task will resume from where it was when BOINC is restarted.
ID: 45492 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Nov 14
Posts: 602
Credit: 24,371,321
RAC: 0
Message 45493 - Posted: 21 Oct 2021, 9:50:32 UTC - in response to Message 45492.  

Could you take a look at mine? I see a large variation too, as compared to what it used to be with 50.00, when it used to run for 13 1/2 hours and was done with it.
Now they vary from 1/2 hour to 12 1/2 hours, though it does not seem to be due to errors to any great extent.
https://lhcathome.cern.ch/lhcathome/results.php?hostid=10697859&offset=0&show_names=0&state=4&appid=

I assumed the new app was allowing the experimenters to do different things, though what they were doing all along is a mystery to me anyway,.
ID: 45493 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,892,123
RAC: 138,176
Message 45495 - Posted: 21 Oct 2021, 10:28:49 UTC - in response to Message 45493.  

I also see those short runtimes (<12 h).

The BOINC clients in question are running only CMS 24/7 without any interruption.
Uploads (stage-out) work fine.
The logfiles do not show any hints that point out any errors nor do my other logs.


@Ivan
Be so kind as to investigate whether a script deeper in the process or a configuration independent from BOINC (Condor/WMAgent) causes a task not to get follow-up jobs (=subtasks).
ID: 45495 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 997
Credit: 6,264,307
RAC: 71
Message 45601 - Posted: 3 Nov 2021, 14:39:24 UTC - in response to Message 45495.  
Last modified: 3 Nov 2021, 14:41:48 UTC

I also see those short runtimes (<12 h).

The BOINC clients in question are running only CMS 24/7 without any interruption.
Uploads (stage-out) work fine.
The logfiles do not show any hints that point out any errors nor do my other logs.


@Ivan
Be so kind as to investigate whether a script deeper in the process or a configuration independent from BOINC (Condor/WMAgent) causes a task not to get follow-up jobs (=subtasks).

The 21st Oct was when we were updating the WMAgent, so no jobs would have been available then. [Edit] Oops, no, I misread the chart... [/Edit]
Jim seems to have had task times >12 hrs in recent times. Yours are still a little short, I hope your 28 machines /430 cores aren't taxing your network bandwidth. 🙂
ID: 45601 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,892,123
RAC: 138,176
Message 45602 - Posted: 3 Nov 2021, 15:27:53 UTC - in response to Message 45601.  

... I hope your 28 machines /430 cores aren't taxing your network bandwidth. 🙂

Since I run multiple BOINC clients on the same box those numbers are not real.
I'm usually running around 35-37 CMS and 35-37 ATLAS tasks concurrently which results in a download saturation of <5% and an upload saturation of <15%.

Unfortunately this afternoon my ISP had a major DSLAM outage that affected a couple of households.
3rd time within a few weeks :-(
May cause some tasks to fail.
ID: 45602 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Nov 14
Posts: 602
Credit: 24,371,321
RAC: 0
Message 45603 - Posted: 3 Nov 2021, 15:57:23 UTC

Yes, they have been well-behaved for me the last week or so, averaging a little over 12 hours.

But I am learning new things about running on 23 cores of a Ryzen 3900X (reserving one core for a GPU).
I received an "out of disk" space error message, and saw that my 250 GB SSD was in fact a little low, which was a surprise.
So I upgraded to 500 GB, which should be plenty, but recently got the error message again.

You have to set the BOINC "Disk usage" parameters manually, and not rely on the defaults. So I am using 500 GB max, and 100% of total disk space.
That seems to have satisfied it, for the moment.
ID: 45603 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 997
Credit: 6,264,307
RAC: 71
Message 45605 - Posted: 3 Nov 2021, 18:56:01 UTC - in response to Message 45603.  
Last modified: 3 Nov 2021, 19:01:36 UTC

Yes, they have been well-behaved for me the last week or so, averaging a little over 12 hours.

But I am learning new things about running on 23 cores of a Ryzen 3900X (reserving one core for a GPU).
I received an "out of disk" space error message, and saw that my 250 GB SSD was in fact a little low, which was a surprise.
So I upgraded to 500 GB, which should be plenty, but recently got the error message again.

You have to set the BOINC "Disk usage" parameters manually, and not rely on the defaults. So I am using 500 GB max, and 100% of total disk space.
That seems to have satisfied it, for the moment.

Good to hear. In the early days of BOINC (which mostly meant SETI@Home) you could let your install "fly blind", but with much more sophisticated applications these days (including such as ours, using VMs and/or GPUs), a little bit of oversight is often necessary. As you've found out, this is especially true in terms of file sizes and network bandwidth. Within CMS we now routinely deal with data files of at least 2 GB (in fact, our sister site*, T3_CH_CMSAtHome, which runs at CERN, does all the "post-production" tasks for CMS@Home, and its major task is JobMerge where it takes your 60-80 MB result files and gathers them into files ~2 GB and stores them into our central storage system.
* In case you didn't know, our production pseudo-site is known in CMS as T3_CH_Volunteer, which translates as "Tier 3" (Tier 0 is CERN, Tier 1 are national facilities, Tier 2 are regional centres such as London, and Tier 3 are seen as institutional set-ups), and CH means Switzerland (obviously...).
ID: 45605 · Report as offensive     Reply Quote

Message boards : CMS Application : Why Such Varied Runtimes?


©2024 CERN