Larger jobs in the pipeline

Author	Message
ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1065 Credit: 7,924,926 RAC: 13,791	Message 46091 - Posted: 18 Jan 2022, 19:30:35 UTC Last modified: 19 Jan 2022, 0:34:49 UTC The next set of jobs I submit will be twice the size of our previous batches. They should start being distributed late tomorrow. Please let me know if you see any significant effects, good or bad, on your throughput, efficiency, bandwidth, etc. ID: 46091 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2549 Credit: 255,411,581 RAC: 64,797	Message 46093 - Posted: 20 Jan 2022, 6:03:35 UTC - in response to Message 46091. The current jobs seem to have a problem. Although glidein returns "0" (which causes BOINC to mark them valid) most tasks "finish" in less than 20 min. The Grafana pages show a steady decrease since 1:48 UTC: https://monit-grafana.cern.ch/d/o3dI49GMz/cms-job-monitoring-12m?orgId=11&from=now-12h&to=now-12m&var-group_by=CMS_JobType&var-Tier=All&var-CMS_WMTool=All&var-CMS_SubmissionTool=All&var-CMS_CampaignType=All&var-Site=T3_CH_Volunteer&var-Type=All&var-CMS_JobType=All&var-CMSPrimaryDataTier=All&var-adhoc=data.RecordTime%7C%3E%7Cnow-12h&viewPanel=15 The BOINC server is still sending out tasks. ID: 46093 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1065 Credit: 7,924,926 RAC: 13,791	Message 46094 - Posted: 20 Jan 2022, 9:46:31 UTC - in response to Message 46093. Hmm, that wasn't supposed to happen... I've resubmitted a workflow with the earlier parameters. The "new" batch has 2,000 jobs created and pending, bur condor isn't sending any jobs on request. I'll be at work in an hour or so, if the situation hasn't improved I'll kill the workflow with the larger jobs. ID: 46094 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1065 Credit: 7,924,926 RAC: 13,791	Message 46096 - Posted: 20 Jan 2022, 12:04:48 UTC - in response to Message 46094. There's a problem with the WMAgent. Whether that's related to my workflow submission I wit not. Relevant responsibles have been e-mailed. ID: 46096 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1065 Credit: 7,924,926 RAC: 13,791	Message 46097 - Posted: 20 Jan 2022, 15:10:11 UTC - in response to Message 46096. Despite the problem with WMAgent (since corrected) it seems that the calculated disk size of the larger jobs exceeded the "requested" size. Since I've yet to find out where that requested size is set, I've reverted to the usual parameters. ID: 46097 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2549 Credit: 255,411,581 RAC: 64,797	Message 46099 - Posted: 20 Jan 2022, 15:51:50 UTC - in response to Message 46097. ... calculated disk size of the larger jobs exceeded the "requested" size. The disk image used for CMS has an upper limit of 20 GB. Recent tasks start around 2.7 GB and grow to nearly 4 GB. What size do you expect the vdi will use at the end of a new task (12 h)? If the additional data is taken from CVMFS and not yet included in the existing vdi (CVMFS cache) a fresh vdi should be prepared. In addition the app template on the BOINC server might require to be set to a higher <rsc_disk_bound> value. ID: 46099 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2245 Credit: 174,006,243 RAC: 8,727	Message 46104 - Posted: 24 Jan 2022, 10:04:30 UTC - in response to Message 46097. Second CMS-Task running since yesterday. If Longrunner, saw 2 GByte download at the beginning. No problems so long:https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10567798 Curios, it's the Computer with WSL2 Testing before. Now Hardware Acceleration is accepted. ID: 46104 · Reply Quote

Toby Broom Volunteer moderator Send message Joined: 27 Sep 08 Posts: 853 Credit: 694,508,596 RAC: 128,116	Message 46107 - Posted: 24 Jan 2022, 17:48:28 UTC Last modified: 24 Jan 2022, 17:48:52 UTC How would I even know if they are bigger? I don't see any increase in failures for CMS ID: 46107 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2245 Credit: 174,006,243 RAC: 8,727	Message 46108 - Posted: 24 Jan 2022, 18:45:53 UTC - in response to Message 46107. There are User deleting hundreds of CMS.Tasks, because of faulty LHC-prefs, instead of deselect CMS. Don't know if big is the number of Tasks for us, or the doing CMS-Tasks. ID: 46108 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2245 Credit: 174,006,243 RAC: 8,727	Message 46112 - Posted: 26 Jan 2022, 17:55:05 UTC - in response to Message 46108. Have today this lines shown in Task, no new work inside since 3 or 4 hours: ERROR:root:Attempt 1 to stage out failed. Automatically retrying in 300 secs Error details: <@========== WMException Start ==========@> Exception Class: StageOutError Message: Command exited non-zero, ExitCode:110 Output: stdout: Wed Jan 26 18:08:42 CET 2022 Copying 120410657 bytes file:///srv/job/WMTaskSpace/cmsRun1/FEVTDEBUGoutput.root => https://data-bridge.cern.ch/myfed/cms-output/store/unmerged/DMWM_Test/RelValSinglePiE50HCAL/GEN-SIM/TC_SLC7_CMS_Home_IDRv6a-v11/00019/F510092E-8A5E-EC4E-B95C-2211A43D618F.root gfal-copy exit status: 110 ERROR: gfal-copy exited with 110 Cleaning up failed file: Wed Jan 26 18:49:14 CET 2022 https://data-bridge.cern.ch/myfed/cms-output/store/unmerged/DMWM_Test/RelValSinglePiE50HCAL/GEN-SIM/TC_SLC7_CMS_Home_IDRv6a-v11/00019/F510092E-8A5E-EC4E-B95C-2211A43D618F.root MISSING stderr: /srv/startup_environment.sh: line 2: BASHOPTS: readonly variable /srv/startup_environment.sh: line 9: BASH_VERSINFO: readonly variable /srv/startup_environment.sh: line 28: EUID: readonly variable /srv/startup_environment.sh: line 137: PPID: readonly variable /srv/startup_environment.sh: line 145: SHELLOPTS: readonly variable /srv/startup_environment.sh: line 159: UID: readonly variable /srv/startup_environment.sh: line 190: syntax error near unexpected token `(' /srv/startup_environment.sh: line 190: `export probe_cvmfs_repos () ' Command timed out after 2400 seconds! /srv/startup_environment.sh: line 2: BASHOPTS: readonly variable /srv/startup_environment.sh: line 9: BASH_VERSINFO: readonly variable /srv/startup_environment.sh: line 28: EUID: readonly variable /srv/startup_environment.sh: line 137: PPID: readonly variable /srv/startup_environment.sh: line 145: SHELLOPTS: readonly variable /srv/startup_environment.sh: line 159: UID: readonly variable /srv/startup_environment.sh: line 190: syntax error near unexpected token `(' /srv/startup_environment.sh: line 190: `export probe_cvmfs_repos () ' ClassName : None ModuleName : WMCore.Storage.StageOutError MethodName : __init__ ClassInstance : None FileName : /srv/job/WMCore.zip/WMCore/Storage/StageOutError.py LineNumber : 32 ErrorNr : 0 Command : #!/bin/bash env -i X509_USER_PROXY=$X509_USER_PROXY JOBSTARTDIR=$JOBSTARTDIR bash -c '. $JOBSTARTDIR/startup_environment.sh; date; gfal-copy -t 2400 -T 2400 -p file:///srv/job/WMTaskSpace/cmsRun1/FEVTDEBUGoutput.root https://data-bridge.cern.ch/myfed/cms-output/store/unmerged/DMWM_Test/RelValSinglePiE50HCAL/GEN-SIM/TC_SLC7_CMS_Home_IDRv6a-v11/00019/F510092E-8A5E-EC4E-B95C-2211A43D618F.root' EXIT_STATUS=$? echo "gfal-copy exit status: $EXIT_STATUS" if [[ $EXIT_STATUS != 0 ]]; then echo "ERROR: gfal-copy exited with $EXIT_STATUS" echo "Cleaning up failed file:" env -i X509_USER_PROXY=$X509_USER_PROXY JOBSTARTDIR=$JOBSTARTDIR bash -c '. $JOBSTARTDIR/startup_environment.sh; date; gfal-rm -t 600 https://data-bridge.cern.ch/myfed/cms-output/store/unmerged/DMWM_Test/RelValSinglePiE50HCAL/GEN-SIM/TC_SLC7_CMS_Home_IDRv6a-v11/00019/F510092E-8A5E-EC4E-B95C-2211A43D618F.root ' fi exit $EXIT_STATUS ExitCode : 110 ErrorCode : 60311 ErrorType : GeneralStageOutFailure Traceback: <@---------- WMException End ----------@> ID: 46112 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1065 Credit: 7,924,926 RAC: 13,791	Message 46113 - Posted: 26 Jan 2022, 18:00:08 UTC - in response to Message 46107. How would I even know if they are bigger? I don't see any increase in failures for CMS You'd have to do the forensic analysis some of our volunteers carry out. Like seeing that jobs take twice as long, or that data transfers are twice as big but half as often. As it is, I've rolled back on it because of the disk problem, and because my health has been a little more delicate than usual this week. ID: 46113 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2245 Credit: 174,006,243 RAC: 8,727	Message 46114 - Posted: 26 Jan 2022, 18:03:50 UTC - in response to Message 46113. All the best for your health, Ivan. ID: 46114 · Reply Quote

Toby Broom Volunteer moderator Send message Joined: 27 Sep 08 Posts: 853 Credit: 694,508,596 RAC: 128,116	Message 46115 - Posted: 26 Jan 2022, 21:45:40 UTC - in response to Message 46113. Hope you recover quickly, thanks for comments ID: 46115 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1065 Credit: 7,924,926 RAC: 13,791	Message 46127 - Posted: 30 Jan 2022, 20:48:52 UTC I tried a new workflow at the weekend, but you may not have noticed it (I only ran 100 jobs). Jobs with a 30 MB result output only took about 5 minutes. Not the CPU/MB ratio we prefer... ID: 46127 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1065 Credit: 7,924,926 RAC: 13,791	Message 46142 - Posted: 1 Feb 2022, 19:06:24 UTC Last modified: 1 Feb 2022, 19:06:54 UTC Well, after some delay, our updated WMAgent is running, so I can return to this topic again. After some review (grepping through my mail archive...), I discovered that I faced the exact problem here almost exactly three years ago (05/02/2019). The fix is simple. I'm running some shorter jobs first, from another project, to get some statistics to feed back to the Monte-Carlo production team, but after that I'll try again to submit 4-hour jobs. If that goes well, I'll also do a short batch of 8-hour jobs before reverting to the "usual" two-hour jobs. Do please report any problems you encounter. ID: 46142 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1429 Credit: 9,528,400 RAC: 4,037	Message 46148 - Posted: 2 Feb 2022, 13:12:14 UTC - in response to Message 46142. Last modified: 2 Feb 2022, 13:14:41 UTC ... but after that I'll try again to submit 4-hour jobs. If that goes well, I'll also do a short batch of 8-hour jobs before reverting to the "usual" two-hour jobs. Do please report any problems you encounter. Hi Ivan, 2, 4 or 8 hour jobs .. It all depends on the cpu speed/nr. of used cores of your computer, how long a job will run. Maybe you could tell how many records/events in 1 job are in the "2", "4" or "8" hours jobs, so anyone can estimate how long a job will run on their own hardware. Before starting with larger jobs there were 10000 events in 1 job. I suppose that are your "2" hour jobs. ID: 46148 · Reply Quote

rbpeake Send message Joined: 17 Sep 04 Posts: 105 Credit: 32,855,188 RAC: 2,341	Message 46149 - Posted: 2 Feb 2022, 14:22:20 UTC The goal is to have these longer running jobs become the standard so that jobs are processed more efficiently (start-up time allocated over a larger number of processed jobs)? Thank you. Regards, Bob P. ID: 46149 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1065 Credit: 7,924,926 RAC: 13,791	Message 46150 - Posted: 2 Feb 2022, 14:36:13 UTC - in response to Message 46148. ... but after that I'll try again to submit 4-hour jobs. If that goes well, I'll also do a short batch of 8-hour jobs before reverting to the "usual" two-hour jobs. Do please report any problems you encounter. Hi Ivan, 2, 4 or 8 hour jobs .. It all depends on the cpu speed/nr. of used cores of your computer, how long a job will run. Maybe you could tell how many records/events in 1 job are in the "2", "4" or "8" hours jobs, so anyone can estimate how long a job will run on their own hardware. Before starting with larger jobs there were 10000 events in 1 job. I suppose that are your "2" hour jobs. Yes, that's the average time as given by the job graphs. I'm aware that older/slower machines will take longer. ID: 46150 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1065 Credit: 7,924,926 RAC: 13,791	Message 46151 - Posted: 2 Feb 2022, 14:39:54 UTC - in response to Message 46149. The goal is to have these longer running jobs become the standard so that jobs are processed more efficiently (start-up time allocated over a larger number of processed jobs)? Thank you. Yes, partly that, and also because 8-hour jobs are the standard for CMS Production which we are trying to move towards. We want to see how feasible that is, given the mix of machinery and network access in the volunteer community. ID: 46151 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2549 Credit: 255,411,581 RAC: 64,797	Message 46158 - Posted: 3 Feb 2022, 9:14:47 UTC It appears that the tasks currently in the queue are running jobs with 20000 records (before: 10000) and that the tasks shut down after 1 job (before: more than 1). Is this by intention? On my slowest box (i7-3770K) this results in task walltimes around 11 h. ID: 46158 · Reply Quote

LHC@home