Message boards : CMS Application : Larger jobs in the pipeline
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · Next

AuthorMessage
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1061
Credit: 7,737,455
RAC: 245
Message 46091 - Posted: 18 Jan 2022, 19:30:35 UTC
Last modified: 19 Jan 2022, 0:34:49 UTC

The next set of jobs I submit will be twice the size of our previous batches. They should start being distributed late tomorrow. Please let me know if you see any significant effects, good or bad, on your throughput, efficiency, bandwidth, etc.
ID: 46091 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2541
Credit: 254,608,838
RAC: 28,391
Message 46093 - Posted: 20 Jan 2022, 6:03:35 UTC - in response to Message 46091.  

The current jobs seem to have a problem.
Although glidein returns "0" (which causes BOINC to mark them valid) most tasks "finish" in less than 20 min.

The Grafana pages show a steady decrease since 1:48 UTC:
https://monit-grafana.cern.ch/d/o3dI49GMz/cms-job-monitoring-12m?orgId=11&from=now-12h&to=now-12m&var-group_by=CMS_JobType&var-Tier=All&var-CMS_WMTool=All&var-CMS_SubmissionTool=All&var-CMS_CampaignType=All&var-Site=T3_CH_Volunteer&var-Type=All&var-CMS_JobType=All&var-CMSPrimaryDataTier=All&var-adhoc=data.RecordTime%7C%3E%7Cnow-12h&viewPanel=15

The BOINC server is still sending out tasks.
ID: 46093 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1061
Credit: 7,737,455
RAC: 245
Message 46094 - Posted: 20 Jan 2022, 9:46:31 UTC - in response to Message 46093.  

Hmm, that wasn't supposed to happen...
I've resubmitted a workflow with the earlier parameters. The "new" batch has 2,000 jobs created and pending, bur condor isn't sending any jobs on request. I'll be at work in an hour or so, if the situation hasn't improved I'll kill the workflow with the larger jobs.
ID: 46094 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1061
Credit: 7,737,455
RAC: 245
Message 46096 - Posted: 20 Jan 2022, 12:04:48 UTC - in response to Message 46094.  

There's a problem with the WMAgent. Whether that's related to my workflow submission I wit not. Relevant responsibles have been e-mailed.
ID: 46096 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1061
Credit: 7,737,455
RAC: 245
Message 46097 - Posted: 20 Jan 2022, 15:10:11 UTC - in response to Message 46096.  

Despite the problem with WMAgent (since corrected) it seems that the calculated disk size of the larger jobs exceeded the "requested" size. Since I've yet to find out where that requested size is set, I've reverted to the usual parameters.
ID: 46097 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2541
Credit: 254,608,838
RAC: 28,391
Message 46099 - Posted: 20 Jan 2022, 15:51:50 UTC - in response to Message 46097.  

... calculated disk size of the larger jobs exceeded the "requested" size.

The disk image used for CMS has an upper limit of 20 GB.
Recent tasks start around 2.7 GB and grow to nearly 4 GB.
What size do you expect the vdi will use at the end of a new task (12 h)?

If the additional data is taken from CVMFS and not yet included in the existing vdi (CVMFS cache) a fresh vdi should be prepared.


In addition the app template on the BOINC server might require to be set to a higher <rsc_disk_bound> value.
ID: 46099 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2244
Credit: 173,902,375
RAC: 374
Message 46104 - Posted: 24 Jan 2022, 10:04:30 UTC - in response to Message 46097.  

Second CMS-Task running since yesterday.
If Longrunner, saw 2 GByte download at the beginning.
No problems so long:https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10567798
Curios, it's the Computer with WSL2 Testing before. Now Hardware Acceleration is accepted.
ID: 46104 · Report as offensive     Reply Quote
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 850
Credit: 692,824,076
RAC: 62,588
Message 46107 - Posted: 24 Jan 2022, 17:48:28 UTC
Last modified: 24 Jan 2022, 17:48:52 UTC

How would I even know if they are bigger?

I don't see any increase in failures for CMS
ID: 46107 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2244
Credit: 173,902,375
RAC: 374
Message 46108 - Posted: 24 Jan 2022, 18:45:53 UTC - in response to Message 46107.  

There are User deleting hundreds of CMS.Tasks, because of faulty LHC-prefs, instead of deselect CMS.
Don't know if big is the number of Tasks for us, or the doing CMS-Tasks.
ID: 46108 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2244
Credit: 173,902,375
RAC: 374
Message 46112 - Posted: 26 Jan 2022, 17:55:05 UTC - in response to Message 46108.  

Have today this lines shown in Task, no new work inside since 3 or 4 hours:
ERROR:root:Attempt 1 to stage out failed.
Automatically retrying in 300 secs
Error details:
<@========== WMException Start ==========@>
Exception Class: StageOutError
Message: Command exited non-zero, ExitCode:110
Output: stdout: Wed Jan 26 18:08:42 CET 2022
Copying 120410657 bytes file:///srv/job/WMTaskSpace/cmsRun1/FEVTDEBUGoutput.root => https://data-bridge.cern.ch/myfed/cms-output/store/unmerged/DMWM_Test/RelValSinglePiE50HCAL/GEN-SIM/TC_SLC7_CMS_Home_IDRv6a-v11/00019/F510092E-8A5E-EC4E-B95C-2211A43D618F.root
gfal-copy exit status: 110
ERROR: gfal-copy exited with 110
Cleaning up failed file:
Wed Jan 26 18:49:14 CET 2022
https://data-bridge.cern.ch/myfed/cms-output/store/unmerged/DMWM_Test/RelValSinglePiE50HCAL/GEN-SIM/TC_SLC7_CMS_Home_IDRv6a-v11/00019/F510092E-8A5E-EC4E-B95C-2211A43D618F.root MISSING

stderr: /srv/startup_environment.sh: line 2: BASHOPTS: readonly variable
/srv/startup_environment.sh: line 9: BASH_VERSINFO: readonly variable
/srv/startup_environment.sh: line 28: EUID: readonly variable
/srv/startup_environment.sh: line 137: PPID: readonly variable
/srv/startup_environment.sh: line 145: SHELLOPTS: readonly variable
/srv/startup_environment.sh: line 159: UID: readonly variable
/srv/startup_environment.sh: line 190: syntax error near unexpected token `('
/srv/startup_environment.sh: line 190: `export probe_cvmfs_repos () '
Command timed out after 2400 seconds!
/srv/startup_environment.sh: line 2: BASHOPTS: readonly variable
/srv/startup_environment.sh: line 9: BASH_VERSINFO: readonly variable
/srv/startup_environment.sh: line 28: EUID: readonly variable
/srv/startup_environment.sh: line 137: PPID: readonly variable
/srv/startup_environment.sh: line 145: SHELLOPTS: readonly variable
/srv/startup_environment.sh: line 159: UID: readonly variable
/srv/startup_environment.sh: line 190: syntax error near unexpected token `('
/srv/startup_environment.sh: line 190: `export probe_cvmfs_repos () '

ClassName : None
ModuleName : WMCore.Storage.StageOutError
MethodName : __init__
ClassInstance : None
FileName : /srv/job/WMCore.zip/WMCore/Storage/StageOutError.py
LineNumber : 32
ErrorNr : 0
Command : #!/bin/bash
env -i X509_USER_PROXY=$X509_USER_PROXY JOBSTARTDIR=$JOBSTARTDIR bash -c '. $JOBSTARTDIR/startup_environment.sh; date; gfal-copy -t 2400 -T 2400 -p file:///srv/job/WMTaskSpace/cmsRun1/FEVTDEBUGoutput.root https://data-bridge.cern.ch/myfed/cms-output/store/unmerged/DMWM_Test/RelValSinglePiE50HCAL/GEN-SIM/TC_SLC7_CMS_Home_IDRv6a-v11/00019/F510092E-8A5E-EC4E-B95C-2211A43D618F.root'
EXIT_STATUS=$?
echo "gfal-copy exit status: $EXIT_STATUS"
if [[ $EXIT_STATUS != 0 ]]; then
echo "ERROR: gfal-copy exited with $EXIT_STATUS"
echo "Cleaning up failed file:"
env -i X509_USER_PROXY=$X509_USER_PROXY JOBSTARTDIR=$JOBSTARTDIR bash -c '. $JOBSTARTDIR/startup_environment.sh; date; gfal-rm -t 600 https://data-bridge.cern.ch/myfed/cms-output/store/unmerged/DMWM_Test/RelValSinglePiE50HCAL/GEN-SIM/TC_SLC7_CMS_Home_IDRv6a-v11/00019/F510092E-8A5E-EC4E-B95C-2211A43D618F.root '
fi
exit $EXIT_STATUS

ExitCode : 110
ErrorCode : 60311
ErrorType : GeneralStageOutFailure

Traceback:

<@---------- WMException End ----------@>
ID: 46112 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1061
Credit: 7,737,455
RAC: 245
Message 46113 - Posted: 26 Jan 2022, 18:00:08 UTC - in response to Message 46107.  

How would I even know if they are bigger?

I don't see any increase in failures for CMS

You'd have to do the forensic analysis some of our volunteers carry out. Like seeing that jobs take twice as long, or that data transfers are twice as big but half as often. As it is, I've rolled back on it because of the disk problem, and because my health has been a little more delicate than usual this week.
ID: 46113 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2244
Credit: 173,902,375
RAC: 374
Message 46114 - Posted: 26 Jan 2022, 18:03:50 UTC - in response to Message 46113.  

All the best for your health, Ivan.
ID: 46114 · Report as offensive     Reply Quote
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 850
Credit: 692,824,076
RAC: 62,588
Message 46115 - Posted: 26 Jan 2022, 21:45:40 UTC - in response to Message 46113.  

Hope you recover quickly, thanks for comments
ID: 46115 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1061
Credit: 7,737,455
RAC: 245
Message 46127 - Posted: 30 Jan 2022, 20:48:52 UTC

I tried a new workflow at the weekend, but you may not have noticed it (I only ran 100 jobs). Jobs with a 30 MB result output only took about 5 minutes.
Not the CPU/MB ratio we prefer...
ID: 46127 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1061
Credit: 7,737,455
RAC: 245
Message 46142 - Posted: 1 Feb 2022, 19:06:24 UTC
Last modified: 1 Feb 2022, 19:06:54 UTC

Well, after some delay, our updated WMAgent is running, so I can return to this topic again. After some review (grepping through my mail archive...), I discovered that I faced the exact problem here almost exactly three years ago (05/02/2019). The fix is simple.
I'm running some shorter jobs first, from another project, to get some statistics to feed back to the Monte-Carlo production team, but after that I'll try again to submit 4-hour jobs. If that goes well, I'll also do a short batch of 8-hour jobs before reverting to the "usual" two-hour jobs. Do please report any problems you encounter.
ID: 46142 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1422
Credit: 9,484,585
RAC: 1,038
Message 46148 - Posted: 2 Feb 2022, 13:12:14 UTC - in response to Message 46142.  
Last modified: 2 Feb 2022, 13:14:41 UTC

... but after that I'll try again to submit 4-hour jobs. If that goes well, I'll also do a short batch of 8-hour jobs before reverting to the "usual" two-hour jobs. Do please report any problems you encounter.
Hi Ivan,

2, 4 or 8 hour jobs .. It all depends on the cpu speed/nr. of used cores of your computer, how long a job will run.
Maybe you could tell how many records/events in 1 job are in the "2", "4" or "8" hours jobs, so anyone can estimate how long a job will run on their own hardware.

Before starting with larger jobs there were 10000 events in 1 job. I suppose that are your "2" hour jobs.
ID: 46148 · Report as offensive     Reply Quote
Profile rbpeake

Send message
Joined: 17 Sep 04
Posts: 105
Credit: 32,824,862
RAC: 72
Message 46149 - Posted: 2 Feb 2022, 14:22:20 UTC

The goal is to have these longer running jobs become the standard so that jobs are processed more efficiently (start-up time allocated over a larger number of processed jobs)?
Thank you.
Regards,
Bob P.
ID: 46149 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1061
Credit: 7,737,455
RAC: 245
Message 46150 - Posted: 2 Feb 2022, 14:36:13 UTC - in response to Message 46148.  

... but after that I'll try again to submit 4-hour jobs. If that goes well, I'll also do a short batch of 8-hour jobs before reverting to the "usual" two-hour jobs. Do please report any problems you encounter.
Hi Ivan,

2, 4 or 8 hour jobs .. It all depends on the cpu speed/nr. of used cores of your computer, how long a job will run.
Maybe you could tell how many records/events in 1 job are in the "2", "4" or "8" hours jobs, so anyone can estimate how long a job will run on their own hardware.

Before starting with larger jobs there were 10000 events in 1 job. I suppose that are your "2" hour jobs.

Yes, that's the average time as given by the job graphs. I'm aware that older/slower machines will take longer.
ID: 46150 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1061
Credit: 7,737,455
RAC: 245
Message 46151 - Posted: 2 Feb 2022, 14:39:54 UTC - in response to Message 46149.  

The goal is to have these longer running jobs become the standard so that jobs are processed more efficiently (start-up time allocated over a larger number of processed jobs)?
Thank you.

Yes, partly that, and also because 8-hour jobs are the standard for CMS Production which we are trying to move towards. We want to see how feasible that is, given the mix of machinery and network access in the volunteer community.
ID: 46151 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2541
Credit: 254,608,838
RAC: 28,391
Message 46158 - Posted: 3 Feb 2022, 9:14:47 UTC

It appears that the tasks currently in the queue are running jobs with 20000 records (before: 10000) and that the tasks shut down after 1 job (before: more than 1).
Is this by intention?

On my slowest box (i7-3770K) this results in task walltimes around 11 h.
ID: 46158 · Report as offensive     Reply Quote
1 · 2 · 3 · Next

Message boards : CMS Application : Larger jobs in the pipeline


©2024 CERN