Message boards : CMS Application : CMS 47.90 WU runs 18 hours, but do nothing after 12 ours of runtime. What can I do to use CPU for CMS more effectively?
Message board moderation

To post messages, you must log in.

AuthorMessage
NOGOOD

Send message
Joined: 18 Nov 17
Posts: 119
Credit: 51,287,231
RAC: 20,565
Message 38222 - Posted: 11 Mar 2019, 21:15:05 UTC

CMS 47.90 WU runs 18 hours, but do nothing after 12 ours of runtime. What can I do to use CPU for CMS more effectively?
ID: 38222 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 997
Credit: 6,264,307
RAC: 71
Message 38231 - Posted: 12 Mar 2019, 21:24:19 UTC - in response to Message 38222.  
Last modified: 12 Mar 2019, 21:25:58 UTC

CMS 47.90 WU runs 18 hours, but do nothing after 12 ours of runtime. What can I do to use CPU for CMS more effectively?

That is something we need to address urgently. As far as CMS is concerned, the new system is running fairly well (in 6 days you guys have generated 600 GB of Monte-Carlo data, with an error rate of only a few percent (jobs failed over jobs submitted).
However, and this may be related to the actual workflow that a colleague provided me for testing, there are large periods of minimal activity on the part of the VM running the simulations. It appears to me that this is mainly related to the first job that a BOINC task runs, suggesting to me that the workflow is trying to access a resource with limited network connectivity (or otherwise limited throughput). Subsequent tasks appear to start up much faster.
But, there are other potential pitfalls that need to be considered. We have in the past had volunteers who simply overestimated their communications bandwidth (remember that the A in ADSL stands for "asymmetric"; you may well have 10 Mbps download speed but only 1 Mbps upload) and ran so many jobs that their pipes were saturated. I aim, at the moment, for jobs taking 60-90 minutes and generating ~60 MB of results to upload. (There is no guarantee that these limits will be an overarching concern when we hand the project over to production runs...)
Then, there was the little glitch this morning when I underestimated how many jobs we had in the queue and blissfully remained in bed while the gales raged around me and the rain lashed the windows. The lack of jobs takes a while to recover from, unfortunately. When we do get to production mode, the number of queued jobs is likely to increase dramatically, so this should be less of a problem.
At the moment, I'm in a data-collecting mode. Feel free to post your efficiencies, etc. I may not be able to reply to every input.
ID: 38231 · Report as offensive     Reply Quote
bronco

Send message
Joined: 13 Apr 18
Posts: 443
Credit: 8,438,885
RAC: 0
Message 38233 - Posted: 13 Mar 2019, 1:02:33 UTC - in response to Message 38231.  

At the moment, I'm in a data-collecting mode. Feel free to post your efficiencies, etc. I may not be able to reply to every input.

I saw average 71% efficiency for 4 CMS tasks on my old Athlon64 X2 but on my old Xeon the efficiency for these 7 tasks ranges from 6% to 39%.
Note the Xeon reports an abysmal IOPS compared to even the old Athlon64 which means it should process events abysmally slow (as it does on ATLAS native) and its reported FLOPs is about equal to the Athlon64 but.... that doesn't explain such low CPU time to run time efficiency. Or does it?
ID: 38233 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,929,849
RAC: 137,649
Message 38236 - Posted: 13 Mar 2019, 8:46:54 UTC - in response to Message 38231.  

... there are large periods of minimal activity on the part of the VM running the simulations. It appears to me that this is mainly related to the first job that a BOINC task runs, suggesting to me that the workflow is trying to access a resource with limited network connectivity (or otherwise limited throughput). Subsequent tasks appear to start up much faster.

As far as I can see in my proxy log the initial delay is mainly caused by the huge number (many 1000) of internet requests to different servers, e.g. cvmfs-stratum-one.cern.ch and cms-frontier.openhtc.io.
Even if the proxy serves nearly 98 % of those requests from it's cache to subsequent VMs that start after the first VM has finished it's setup, those subsequent VMs need roughly 20 min to start job processing.


The idle gap between two subsequent jobs seems to be caused by the time that is necessary to upload the result and a fix 10 min condor delay.
ID: 38236 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,369,407
RAC: 101,957
Message 38239 - Posted: 13 Mar 2019, 11:13:07 UTC

my latest task (got finished a few minutes ago):
total time: 64,861.61 secs; CPU time: 40,221.78 secs.
ID: 38239 · Report as offensive     Reply Quote
NOGOOD

Send message
Joined: 18 Nov 17
Posts: 119
Credit: 51,287,231
RAC: 20,565
Message 38240 - Posted: 13 Mar 2019, 11:39:10 UTC - in response to Message 38239.  

It is good.
But I have a lot of task with about 64 000 secs of total time and only about 4 000 secs of CPU time :-(
Internet connection is stable at 100 mbit per sec.
ID: 38240 · Report as offensive     Reply Quote
NOGOOD

Send message
Joined: 18 Nov 17
Posts: 119
Credit: 51,287,231
RAC: 20,565
Message 38241 - Posted: 13 Mar 2019, 11:45:38 UTC - in response to Message 38240.  

My best is 32 000 secs of CPU time, total time is always about 64 000 secs.
ID: 38241 · Report as offensive     Reply Quote
gyllic

Send message
Joined: 9 Dec 14
Posts: 202
Credit: 2,533,875
RAC: 0
Message 38243 - Posted: 14 Mar 2019, 11:09:27 UTC - in response to Message 38231.  

It appears to me that this is mainly related to the first job that a BOINC task runs, suggesting to me that the workflow is trying to access a resource with limited network connectivity (or otherwise limited throughput).
Maybe using openhtc would help overcoming this issue?
ID: 38243 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,369,407
RAC: 101,957
Message 38253 - Posted: 16 Mar 2019, 10:14:31 UTC

with my most recent task, it got even worse:

total time: 64,863.64 secs; CPU time: 38,658.58 secs
ID: 38253 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 997
Credit: 6,264,307
RAC: 71
Message 38269 - Posted: 18 Mar 2019, 21:14:40 UTC - in response to Message 38253.  

It's running that way for me, too, both single-core VMs for LHC@Home, and single- and dual-core jobs in -dev. At some point the tasks stop running jobs and sit idle until the 18-hour time-out occurs. Unfortunately they are still occupying BOINC slots, so other projects don't take up the slack.
ID: 38269 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,929,849
RAC: 137,649
Message 38289 - Posted: 19 Mar 2019, 12:18:28 UTC

CMS tasks currently show the following work pattern.

Phase 1:
Download and VM setup

Phase 2:
Process subtask (=job) #1

Pase 3:
Process subtasks #2 - #n

Phase 4:
VM shutdown



Comments

During phase 1 and (partly) phase 2 the VM downloads roughly 17000 files/800 MB.
Most of them are requested from cvmfs-stratum-one.cern.ch (Europe; may be another s1 CVMFS server depending on the geolocation).
About 10% are requested from cms-frontier.openhtc.io.


Even on a fast (100 Mbit/s) connection this setup needs about 20 min while the CPU load is only sometimes higher than idle.
A local proxy can deliver roughly 98% of this downloads hence the startup will only need 5-8 min.


The setup includes few downloads from the epel repositories.
At least at my home location the VM first tries the slowest possible epel mirror (155 Mbit/s) instead of other mirrors that are as close and provide up to 10000 Mbit/s.
A workaround is to reject requests to the slow server at the firewall.
Hence the VM will then immediately try other mirrors.


Another issue is the access to gitlab.cern.ch where the VM requests parts of singularity from.
I sometimes notice VMs that do not correctly resolve the IPs of the gitlab servers.
As a result the singularity setup remains incomplete.
Those VMs never start a subtask. Instead they remain idle until the 18h watchdog shuts them down.
I suspect this is caused by a misconfigured DNS at CERN (lame DNS servers; missing/wrong CNAME entries).



Phase 2 and each subtask processing in phase 3 end with an upload of (currently) 40-60 MB followed by an idle time of usually 10 min.
Each subtask processing from phase 3 starts with additional downloads of >1700 files mainly from cms-frontier.openhtc.io.


Subtask #n that finishes after the 12h limit should shutdown the VM.
Instead the VM remains idle until the 18h watchdog shuts it down.
This should be investigated by the CMS team.
A temporary workaround could be to lower the limit in CMS_2016_03_22.xml -> <job_duration>xxxxx</job_duration>.
It must not be set too low to leave subtask #n enough time to finish.
Changes there become active with the next VM that starts after the change.
ID: 38289 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 997
Credit: 6,264,307
RAC: 71
Message 38341 - Posted: 20 Mar 2019, 10:48:26 UTC - in response to Message 38289.  

Thanks for that analysis -- I was going to ask for some of your logs, but you've saved me the trouble. I'll pass your comments on. There are still several days left in this workflow; if there's no resolution by then I'll submit a batch of the previous workflow to see if it has the same problems.
ID: 38341 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 997
Credit: 6,264,307
RAC: 71
Message 38345 - Posted: 20 Mar 2019, 13:45:35 UTC
Last modified: 20 Mar 2019, 13:46:53 UTC

We've got a couple more clues as to when some problems may have started (Mar 15 15:14 CERN time) but no solution yet. Curiously that was when we'd written just a very small fraction more than 1 TiB of merged results to central storage...
ID: 38345 · Report as offensive     Reply Quote
NOGOOD

Send message
Joined: 18 Nov 17
Posts: 119
Credit: 51,287,231
RAC: 20,565
Message 38359 - Posted: 21 Mar 2019, 11:20:10 UTC - in response to Message 38289.  

CMS tasks currently show the following work pattern.

Phase 1:
Download and VM setup

Phase 2:
Process subtask (=job) #1

Pase 3:
Process subtasks #2 - #n

Phase 4:
VM shutdown



Comments

During phase 1 and (partly) phase 2 the VM downloads roughly 17000 files/800 MB.
Most of them are requested from cvmfs-stratum-one.cern.ch (Europe; may be another s1 CVMFS server depending on the geolocation).
About 10% are requested from cms-frontier.openhtc.io.


Even on a fast (100 Mbit/s) connection this setup needs about 20 min while the CPU load is only sometimes higher than idle.
A local proxy can deliver roughly 98% of this downloads hence the startup will only need 5-8 min.


The setup includes few downloads from the epel repositories.
At least at my home location the VM first tries the slowest possible epel mirror (155 Mbit/s) instead of other mirrors that are as close and provide up to 10000 Mbit/s.
A workaround is to reject requests to the slow server at the firewall.
Hence the VM will then immediately try other mirrors.


Another issue is the access to gitlab.cern.ch where the VM requests parts of singularity from.
I sometimes notice VMs that do not correctly resolve the IPs of the gitlab servers.
As a result the singularity setup remains incomplete.
Those VMs never start a subtask. Instead they remain idle until the 18h watchdog shuts them down.
I suspect this is caused by a misconfigured DNS at CERN (lame DNS servers; missing/wrong CNAME entries).



Phase 2 and each subtask processing in phase 3 end with an upload of (currently) 40-60 MB followed by an idle time of usually 10 min.
Each subtask processing from phase 3 starts with additional downloads of >1700 files mainly from cms-frontier.openhtc.io.


Subtask #n that finishes after the 12h limit should shutdown the VM.
Instead the VM remains idle until the 18h watchdog shuts it down.
This should be investigated by the CMS team.
A temporary workaround could be to lower the limit in CMS_2016_03_22.xml -> <job_duration>xxxxx</job_duration>.
It must not be set too low to leave subtask #n enough time to finish.
Changes there become active with the next VM that starts after the change.


Theory Simulation multicore tasks has the same problem (idle cores between 12 and 18 hours of runtime), but Theory Simulation singlecore tasks are fine.
May be experience of Theory Simulation team can be useful for CMS team.
ID: 38359 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,929,849
RAC: 137,649
Message 38362 - Posted: 21 Mar 2019, 11:53:47 UTC - in response to Message 38359.  

Theory Simulation multicore tasks has the same problem (idle cores between 12 and 18 hours of runtime), but Theory Simulation singlecore tasks are fine.
May be experience of Theory Simulation team can be useful for CMS team.

The multicore-idle-problem at Theory is caused by longrunners like sherpa.
While other cores may have finished their jobs, the longrunner keeps the whole VM active.
Worst case will be the forced 18h watchdog shutdown.

From the Boinc perspective the idle cores remain allocated until the Theory task is completely shut down.
One reason to prefer a singlecore setup as long as the host has enough RAM to run a couple of them concurrently.

CMS jobs have comparable runtimes. Hence the final idle period in a multicore setup shouldn't be that long.
You may test it at the -dev project.

Be aware that ATLAS behaves completely different as it really uses all configured cores per job.
ID: 38362 · Report as offensive     Reply Quote

Message boards : CMS Application : CMS 47.90 WU runs 18 hours, but do nothing after 12 ours of runtime. What can I do to use CPU for CMS more effectively?


©2024 CERN