CMS 47.90 WU runs 18 hours, but do nothing after 12 ours of runtime. What can I do to use CPU for CMS more effectively?

Author	Message
NOGOOD Send message Joined: 18 Nov 17 Posts: 131 Credit: 58,756,838 RAC: 8,899	Message 38222 - Posted: 11 Mar 2019, 21:15:05 UTC CMS 47.90 WU runs 18 hours, but do nothing after 12 ours of runtime. What can I do to use CPU for CMS more effectively? ID: 38222 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1120 Credit: 10,695,348 RAC: 17,057	Message 38231 - Posted: 12 Mar 2019, 21:24:19 UTC - in response to Message 38222. Last modified: 12 Mar 2019, 21:25:58 UTC CMS 47.90 WU runs 18 hours, but do nothing after 12 ours of runtime. What can I do to use CPU for CMS more effectively? That is something we need to address urgently. As far as CMS is concerned, the new system is running fairly well (in 6 days you guys have generated 600 GB of Monte-Carlo data, with an error rate of only a few percent (jobs failed over jobs submitted). However, and this may be related to the actual workflow that a colleague provided me for testing, there are large periods of minimal activity on the part of the VM running the simulations. It appears to me that this is mainly related to the first job that a BOINC task runs, suggesting to me that the workflow is trying to access a resource with limited network connectivity (or otherwise limited throughput). Subsequent tasks appear to start up much faster. But, there are other potential pitfalls that need to be considered. We have in the past had volunteers who simply overestimated their communications bandwidth (remember that the A in ADSL stands for "asymmetric"; you may well have 10 Mbps download speed but only 1 Mbps upload) and ran so many jobs that their pipes were saturated. I aim, at the moment, for jobs taking 60-90 minutes and generating ~60 MB of results to upload. (There is no guarantee that these limits will be an overarching concern when we hand the project over to production runs...) Then, there was the little glitch this morning when I underestimated how many jobs we had in the queue and blissfully remained in bed while the gales raged around me and the rain lashed the windows. The lack of jobs takes a while to recover from, unfortunately. When we do get to production mode, the number of queued jobs is likely to increase dramatically, so this should be less of a problem. At the moment, I'm in a data-collecting mode. Feel free to post your efficiencies, etc. I may not be able to reply to every input. ID: 38231 · Reply Quote

bronco Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0	Message 38233 - Posted: 13 Mar 2019, 1:02:33 UTC - in response to Message 38231. At the moment, I'm in a data-collecting mode. Feel free to post your efficiencies, etc. I may not be able to reply to every input. I saw average 71% efficiency for 4 CMS tasks on my old Athlon64 X2 but on my old Xeon the efficiency for these 7 tasks ranges from 6% to 39%. Note the Xeon reports an abysmal IOPS compared to even the old Athlon64 which means it should process events abysmally slow (as it does on ATLAS native) and its reported FLOPs is about equal to the Athlon64 but.... that doesn't explain such low CPU time to run time efficiency. Or does it? ID: 38233 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2716 Credit: 294,734,783 RAC: 149,028	Message 38236 - Posted: 13 Mar 2019, 8:46:54 UTC - in response to Message 38231. ... there are large periods of minimal activity on the part of the VM running the simulations. It appears to me that this is mainly related to the first job that a BOINC task runs, suggesting to me that the workflow is trying to access a resource with limited network connectivity (or otherwise limited throughput). Subsequent tasks appear to start up much faster. As far as I can see in my proxy log the initial delay is mainly caused by the huge number (many 1000) of internet requests to different servers, e.g. cvmfs-stratum-one.cern.ch and cms-frontier.openhtc.io. Even if the proxy serves nearly 98 % of those requests from it's cache to subsequent VMs that start after the first VM has finished it's setup, those subsequent VMs need roughly 20 min to start job processing. The idle gap between two subsequent jobs seems to be caused by the time that is necessary to upload the result and a fix 10 min condor delay. ID: 38236 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1925 Credit: 151,998,404 RAC: 141,136	Message 38239 - Posted: 13 Mar 2019, 11:13:07 UTC my latest task (got finished a few minutes ago): total time: 64,861.61 secs; CPU time: 40,221.78 secs. ID: 38239 · Reply Quote

NOGOOD Send message Joined: 18 Nov 17 Posts: 131 Credit: 58,756,838 RAC: 8,899	Message 38240 - Posted: 13 Mar 2019, 11:39:10 UTC - in response to Message 38239. It is good. But I have a lot of task with about 64 000 secs of total time and only about 4 000 secs of CPU time :-( Internet connection is stable at 100 mbit per sec. ID: 38240 · Reply Quote

NOGOOD Send message Joined: 18 Nov 17 Posts: 131 Credit: 58,756,838 RAC: 8,899	Message 38241 - Posted: 13 Mar 2019, 11:45:38 UTC - in response to Message 38240. My best is 32 000 secs of CPU time, total time is always about 64 000 secs. ID: 38241 · Reply Quote

gyllic Send message Joined: 9 Dec 14 Posts: 202 Credit: 2,648,373 RAC: 7,455	Message 38243 - Posted: 14 Mar 2019, 11:09:27 UTC - in response to Message 38231. It appears to me that this is mainly related to the first job that a BOINC task runs, suggesting to me that the workflow is trying to access a resource with limited network connectivity (or otherwise limited throughput). Maybe using openhtc would help overcoming this issue? ID: 38243 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1925 Credit: 151,998,404 RAC: 141,136	Message 38253 - Posted: 16 Mar 2019, 10:14:31 UTC with my most recent task, it got even worse: total time: 64,863.64 secs; CPU time: 38,658.58 secs ID: 38253 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1120 Credit: 10,695,348 RAC: 17,057	Message 38269 - Posted: 18 Mar 2019, 21:14:40 UTC - in response to Message 38253. It's running that way for me, too, both single-core VMs for LHC@Home, and single- and dual-core jobs in -dev. At some point the tasks stop running jobs and sit idle until the 18-hour time-out occurs. Unfortunately they are still occupying BOINC slots, so other projects don't take up the slack. ID: 38269 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2716 Credit: 294,734,783 RAC: 149,028	Message 38289 - Posted: 19 Mar 2019, 12:18:28 UTC CMS tasks currently show the following work pattern. Phase 1: Download and VM setup Phase 2: Process subtask (=job) #1 Pase 3: Process subtasks #2 - #n Phase 4: VM shutdown Comments During phase 1 and (partly) phase 2 the VM downloads roughly 17000 files/800 MB. Most of them are requested from cvmfs-stratum-one.cern.ch (Europe; may be another s1 CVMFS server depending on the geolocation). About 10% are requested from cms-frontier.openhtc.io. Even on a fast (100 Mbit/s) connection this setup needs about 20 min while the CPU load is only sometimes higher than idle. A local proxy can deliver roughly 98% of this downloads hence the startup will only need 5-8 min. The setup includes few downloads from the epel repositories. At least at my home location the VM first tries the slowest possible epel mirror (155 Mbit/s) instead of other mirrors that are as close and provide up to 10000 Mbit/s. A workaround is to reject requests to the slow server at the firewall. Hence the VM will then immediately try other mirrors. Another issue is the access to gitlab.cern.ch where the VM requests parts of singularity from. I sometimes notice VMs that do not correctly resolve the IPs of the gitlab servers. As a result the singularity setup remains incomplete. Those VMs never start a subtask. Instead they remain idle until the 18h watchdog shuts them down. I suspect this is caused by a misconfigured DNS at CERN (lame DNS servers; missing/wrong CNAME entries). Phase 2 and each subtask processing in phase 3 end with an upload of (currently) 40-60 MB followed by an idle time of usually 10 min. Each subtask processing from phase 3 starts with additional downloads of >1700 files mainly from cms-frontier.openhtc.io. Subtask #n that finishes after the 12h limit should shutdown the VM. Instead the VM remains idle until the 18h watchdog shuts it down. This should be investigated by the CMS team. A temporary workaround could be to lower the limit in CMS_2016_03_22.xml -> <job_duration>xxxxx</job_duration>. It must not be set too low to leave subtask #n enough time to finish. Changes there become active with the next VM that starts after the change. ID: 38289 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1120 Credit: 10,695,348 RAC: 17,057	Message 38341 - Posted: 20 Mar 2019, 10:48:26 UTC - in response to Message 38289. Thanks for that analysis -- I was going to ask for some of your logs, but you've saved me the trouble. I'll pass your comments on. There are still several days left in this workflow; if there's no resolution by then I'll submit a batch of the previous workflow to see if it has the same problems. ID: 38341 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1120 Credit: 10,695,348 RAC: 17,057	Message 38345 - Posted: 20 Mar 2019, 13:45:35 UTC Last modified: 20 Mar 2019, 13:46:53 UTC We've got a couple more clues as to when some problems may have started (Mar 15 15:14 CERN time) but no solution yet. Curiously that was when we'd written just a very small fraction more than 1 TiB of merged results to central storage... ID: 38345 · Reply Quote

NOGOOD Send message Joined: 18 Nov 17 Posts: 131 Credit: 58,756,838 RAC: 8,899	Message 38359 - Posted: 21 Mar 2019, 11:20:10 UTC - in response to Message 38289. CMS tasks currently show the following work pattern. Phase 1: Download and VM setup Phase 2: Process subtask (=job) #1 Pase 3: Process subtasks #2 - #n Phase 4: VM shutdown Comments During phase 1 and (partly) phase 2 the VM downloads roughly 17000 files/800 MB. Most of them are requested from cvmfs-stratum-one.cern.ch (Europe; may be another s1 CVMFS server depending on the geolocation). About 10% are requested from cms-frontier.openhtc.io. Even on a fast (100 Mbit/s) connection this setup needs about 20 min while the CPU load is only sometimes higher than idle. A local proxy can deliver roughly 98% of this downloads hence the startup will only need 5-8 min. The setup includes few downloads from the epel repositories. At least at my home location the VM first tries the slowest possible epel mirror (155 Mbit/s) instead of other mirrors that are as close and provide up to 10000 Mbit/s. A workaround is to reject requests to the slow server at the firewall. Hence the VM will then immediately try other mirrors. Another issue is the access to gitlab.cern.ch where the VM requests parts of singularity from. I sometimes notice VMs that do not correctly resolve the IPs of the gitlab servers. As a result the singularity setup remains incomplete. Those VMs never start a subtask. Instead they remain idle until the 18h watchdog shuts them down. I suspect this is caused by a misconfigured DNS at CERN (lame DNS servers; missing/wrong CNAME entries). Phase 2 and each subtask processing in phase 3 end with an upload of (currently) 40-60 MB followed by an idle time of usually 10 min. Each subtask processing from phase 3 starts with additional downloads of >1700 files mainly from cms-frontier.openhtc.io. Subtask #n that finishes after the 12h limit should shutdown the VM. Instead the VM remains idle until the 18h watchdog shuts it down. This should be investigated by the CMS team. A temporary workaround could be to lower the limit in CMS_2016_03_22.xml -> <job_duration>xxxxx</job_duration>. It must not be set too low to leave subtask #n enough time to finish. Changes there become active with the next VM that starts after the change. Theory Simulation multicore tasks has the same problem (idle cores between 12 and 18 hours of runtime), but Theory Simulation singlecore tasks are fine. May be experience of Theory Simulation team can be useful for CMS team. ID: 38359 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2716 Credit: 294,734,783 RAC: 149,028	Message 38362 - Posted: 21 Mar 2019, 11:53:47 UTC - in response to Message 38359. Theory Simulation multicore tasks has the same problem (idle cores between 12 and 18 hours of runtime), but Theory Simulation singlecore tasks are fine. May be experience of Theory Simulation team can be useful for CMS team. The multicore-idle-problem at Theory is caused by longrunners like sherpa. While other cores may have finished their jobs, the longrunner keeps the whole VM active. Worst case will be the forced 18h watchdog shutdown. From the Boinc perspective the idle cores remain allocated until the Theory task is completely shut down. One reason to prefer a singlecore setup as long as the host has enough RAM to run a couple of them concurrently. CMS jobs have comparable runtimes. Hence the final idle period in a multicore setup shouldn't be that long. You may test it at the -dev project. Be aware that ATLAS behaves completely different as it really uses all configured cores per job. ID: 38362 · Reply Quote

LHC@home