Message boards :
CMS Application :
Larger jobs in the pipeline
Message board moderation
Author | Message |
---|---|
Send message Joined: 29 Aug 05 Posts: 1061 Credit: 7,737,455 RAC: 245 |
|
Send message Joined: 15 Jun 08 Posts: 2541 Credit: 254,608,838 RAC: 28,391 |
The current jobs seem to have a problem. Although glidein returns "0" (which causes BOINC to mark them valid) most tasks "finish" in less than 20 min. The Grafana pages show a steady decrease since 1:48 UTC: https://monit-grafana.cern.ch/d/o3dI49GMz/cms-job-monitoring-12m?orgId=11&from=now-12h&to=now-12m&var-group_by=CMS_JobType&var-Tier=All&var-CMS_WMTool=All&var-CMS_SubmissionTool=All&var-CMS_CampaignType=All&var-Site=T3_CH_Volunteer&var-Type=All&var-CMS_JobType=All&var-CMSPrimaryDataTier=All&var-adhoc=data.RecordTime%7C%3E%7Cnow-12h&viewPanel=15 The BOINC server is still sending out tasks. |
Send message Joined: 29 Aug 05 Posts: 1061 Credit: 7,737,455 RAC: 245 |
Hmm, that wasn't supposed to happen... I've resubmitted a workflow with the earlier parameters. The "new" batch has 2,000 jobs created and pending, bur condor isn't sending any jobs on request. I'll be at work in an hour or so, if the situation hasn't improved I'll kill the workflow with the larger jobs. |
Send message Joined: 29 Aug 05 Posts: 1061 Credit: 7,737,455 RAC: 245 |
|
Send message Joined: 29 Aug 05 Posts: 1061 Credit: 7,737,455 RAC: 245 |
|
Send message Joined: 15 Jun 08 Posts: 2541 Credit: 254,608,838 RAC: 28,391 |
... calculated disk size of the larger jobs exceeded the "requested" size. The disk image used for CMS has an upper limit of 20 GB. Recent tasks start around 2.7 GB and grow to nearly 4 GB. What size do you expect the vdi will use at the end of a new task (12 h)? If the additional data is taken from CVMFS and not yet included in the existing vdi (CVMFS cache) a fresh vdi should be prepared. In addition the app template on the BOINC server might require to be set to a higher <rsc_disk_bound> value. |
Send message Joined: 2 May 07 Posts: 2244 Credit: 173,902,375 RAC: 374 |
Second CMS-Task running since yesterday. If Longrunner, saw 2 GByte download at the beginning. No problems so long:https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10567798 Curios, it's the Computer with WSL2 Testing before. Now Hardware Acceleration is accepted. |
Send message Joined: 27 Sep 08 Posts: 850 Credit: 692,824,076 RAC: 62,588 |
How would I even know if they are bigger? I don't see any increase in failures for CMS |
Send message Joined: 2 May 07 Posts: 2244 Credit: 173,902,375 RAC: 374 |
There are User deleting hundreds of CMS.Tasks, because of faulty LHC-prefs, instead of deselect CMS. Don't know if big is the number of Tasks for us, or the doing CMS-Tasks. |
Send message Joined: 2 May 07 Posts: 2244 Credit: 173,902,375 RAC: 374 |
Have today this lines shown in Task, no new work inside since 3 or 4 hours: ERROR:root:Attempt 1 to stage out failed. Automatically retrying in 300 secs Error details: <@========== WMException Start ==========@> Exception Class: StageOutError Message: Command exited non-zero, ExitCode:110 Output: stdout: Wed Jan 26 18:08:42 CET 2022 Copying 120410657 bytes file:///srv/job/WMTaskSpace/cmsRun1/FEVTDEBUGoutput.root => https://data-bridge.cern.ch/myfed/cms-output/store/unmerged/DMWM_Test/RelValSinglePiE50HCAL/GEN-SIM/TC_SLC7_CMS_Home_IDRv6a-v11/00019/F510092E-8A5E-EC4E-B95C-2211A43D618F.root gfal-copy exit status: 110 ERROR: gfal-copy exited with 110 Cleaning up failed file: Wed Jan 26 18:49:14 CET 2022 https://data-bridge.cern.ch/myfed/cms-output/store/unmerged/DMWM_Test/RelValSinglePiE50HCAL/GEN-SIM/TC_SLC7_CMS_Home_IDRv6a-v11/00019/F510092E-8A5E-EC4E-B95C-2211A43D618F.root MISSING stderr: /srv/startup_environment.sh: line 2: BASHOPTS: readonly variable /srv/startup_environment.sh: line 9: BASH_VERSINFO: readonly variable /srv/startup_environment.sh: line 28: EUID: readonly variable /srv/startup_environment.sh: line 137: PPID: readonly variable /srv/startup_environment.sh: line 145: SHELLOPTS: readonly variable /srv/startup_environment.sh: line 159: UID: readonly variable /srv/startup_environment.sh: line 190: syntax error near unexpected token `(' /srv/startup_environment.sh: line 190: `export probe_cvmfs_repos () ' Command timed out after 2400 seconds! /srv/startup_environment.sh: line 2: BASHOPTS: readonly variable /srv/startup_environment.sh: line 9: BASH_VERSINFO: readonly variable /srv/startup_environment.sh: line 28: EUID: readonly variable /srv/startup_environment.sh: line 137: PPID: readonly variable /srv/startup_environment.sh: line 145: SHELLOPTS: readonly variable /srv/startup_environment.sh: line 159: UID: readonly variable /srv/startup_environment.sh: line 190: syntax error near unexpected token `(' /srv/startup_environment.sh: line 190: `export probe_cvmfs_repos () ' ClassName : None ModuleName : WMCore.Storage.StageOutError MethodName : __init__ ClassInstance : None FileName : /srv/job/WMCore.zip/WMCore/Storage/StageOutError.py LineNumber : 32 ErrorNr : 0 Command : #!/bin/bash env -i X509_USER_PROXY=$X509_USER_PROXY JOBSTARTDIR=$JOBSTARTDIR bash -c '. $JOBSTARTDIR/startup_environment.sh; date; gfal-copy -t 2400 -T 2400 -p file:///srv/job/WMTaskSpace/cmsRun1/FEVTDEBUGoutput.root https://data-bridge.cern.ch/myfed/cms-output/store/unmerged/DMWM_Test/RelValSinglePiE50HCAL/GEN-SIM/TC_SLC7_CMS_Home_IDRv6a-v11/00019/F510092E-8A5E-EC4E-B95C-2211A43D618F.root' EXIT_STATUS=$? echo "gfal-copy exit status: $EXIT_STATUS" if [[ $EXIT_STATUS != 0 ]]; then echo "ERROR: gfal-copy exited with $EXIT_STATUS" echo "Cleaning up failed file:" env -i X509_USER_PROXY=$X509_USER_PROXY JOBSTARTDIR=$JOBSTARTDIR bash -c '. $JOBSTARTDIR/startup_environment.sh; date; gfal-rm -t 600 https://data-bridge.cern.ch/myfed/cms-output/store/unmerged/DMWM_Test/RelValSinglePiE50HCAL/GEN-SIM/TC_SLC7_CMS_Home_IDRv6a-v11/00019/F510092E-8A5E-EC4E-B95C-2211A43D618F.root ' fi exit $EXIT_STATUS ExitCode : 110 ErrorCode : 60311 ErrorType : GeneralStageOutFailure Traceback: <@---------- WMException End ----------@> |
Send message Joined: 29 Aug 05 Posts: 1061 Credit: 7,737,455 RAC: 245 |
How would I even know if they are bigger? You'd have to do the forensic analysis some of our volunteers carry out. Like seeing that jobs take twice as long, or that data transfers are twice as big but half as often. As it is, I've rolled back on it because of the disk problem, and because my health has been a little more delicate than usual this week. |
Send message Joined: 2 May 07 Posts: 2244 Credit: 173,902,375 RAC: 374 |
All the best for your health, Ivan. |
Send message Joined: 27 Sep 08 Posts: 850 Credit: 692,824,076 RAC: 62,588 |
Hope you recover quickly, thanks for comments |
Send message Joined: 29 Aug 05 Posts: 1061 Credit: 7,737,455 RAC: 245 |
|
Send message Joined: 29 Aug 05 Posts: 1061 Credit: 7,737,455 RAC: 245 |
Well, after some delay, our updated WMAgent is running, so I can return to this topic again. After some review (grepping through my mail archive...), I discovered that I faced the exact problem here almost exactly three years ago (05/02/2019). The fix is simple. I'm running some shorter jobs first, from another project, to get some statistics to feed back to the Monte-Carlo production team, but after that I'll try again to submit 4-hour jobs. If that goes well, I'll also do a short batch of 8-hour jobs before reverting to the "usual" two-hour jobs. Do please report any problems you encounter. |
Send message Joined: 14 Jan 10 Posts: 1422 Credit: 9,484,585 RAC: 1,038 |
... but after that I'll try again to submit 4-hour jobs. If that goes well, I'll also do a short batch of 8-hour jobs before reverting to the "usual" two-hour jobs. Do please report any problems you encounter.Hi Ivan, 2, 4 or 8 hour jobs .. It all depends on the cpu speed/nr. of used cores of your computer, how long a job will run. Maybe you could tell how many records/events in 1 job are in the "2", "4" or "8" hours jobs, so anyone can estimate how long a job will run on their own hardware. Before starting with larger jobs there were 10000 events in 1 job. I suppose that are your "2" hour jobs. |
Send message Joined: 17 Sep 04 Posts: 105 Credit: 32,824,862 RAC: 72 |
The goal is to have these longer running jobs become the standard so that jobs are processed more efficiently (start-up time allocated over a larger number of processed jobs)? Thank you. Regards, Bob P. |
Send message Joined: 29 Aug 05 Posts: 1061 Credit: 7,737,455 RAC: 245 |
... but after that I'll try again to submit 4-hour jobs. If that goes well, I'll also do a short batch of 8-hour jobs before reverting to the "usual" two-hour jobs. Do please report any problems you encounter.Hi Ivan, Yes, that's the average time as given by the job graphs. I'm aware that older/slower machines will take longer. |
Send message Joined: 29 Aug 05 Posts: 1061 Credit: 7,737,455 RAC: 245 |
The goal is to have these longer running jobs become the standard so that jobs are processed more efficiently (start-up time allocated over a larger number of processed jobs)? Yes, partly that, and also because 8-hour jobs are the standard for CMS Production which we are trying to move towards. We want to see how feasible that is, given the mix of machinery and network access in the volunteer community. |
Send message Joined: 15 Jun 08 Posts: 2541 Credit: 254,608,838 RAC: 28,391 |
It appears that the tasks currently in the queue are running jobs with 20000 records (before: 10000) and that the tasks shut down after 1 job (before: more than 1). Is this by intention? On my slowest box (i7-3770K) this results in task walltimes around 11 h. |
©2024 CERN