Larger jobs in the pipeline

Author	Message
Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1429 Credit: 9,539,339 RAC: 5,065	Message 46160 - Posted: 3 Feb 2022, 10:11:43 UTC - in response to Message 46158. Last modified: 3 Feb 2022, 15:32:03 UTC Yesterday I had 3 VMs running. After each had done 3 jobs (20000 events/job), I didn't got new jobs at that moment and decided to shutdown the VM's gracefully and the PC could go to sleep overnight. @Ivan: How BOINC-CMS is setup now, running larger jobs will result in waste of cpu-power for the slower machines. Example: see post from computezrmle before this one. 1 job needs 11 hours. Normally a second is started, because 12 hours is not over yet., but after 18 hours this second running job is killed before the finish. Would the 1st job has been 40000 events the job would have been totally useless. I recommend to setup BOINC like it's done with Theory some years ago. Only run 1 job during BOINC's VM lifetime. ID: 46160 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2549 Credit: 255,447,187 RAC: 66,454	Message 46161 - Posted: 3 Feb 2022, 10:39:10 UTC - in response to Message 46160. Are you sure that each VM ran 3 of those large jobs? To me it looks like each VM ran 1 job. 20000 events within 18500 (CPU) seconds seems to be a normal value. https://lhcathome.cern.ch/lhcathome/result.php?resultid=342364179 https://lhcathome.cern.ch/lhcathome/result.php?resultid=342364448 https://lhcathome.cern.ch/lhcathome/result.php?resultid=342364474 The reason why my i7-3770K is "slow" (walltime) is mainly caused by a wanted overload controlled by systemd/cgroup slices and many other tasks running concurrently. ID: 46161 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1429 Credit: 9,539,339 RAC: 5,065	Message 46162 - Posted: 3 Feb 2022, 12:55:34 UTC - in response to Message 46161. Are you sure that each VM ran 3 of those large jobs? To me it looks like each VM ran 1 job. Yeah, I'm sure. I had three Consoles running on my laptop to see the progress of the 3 remote VM's and wanted to know how many events in 1 job exist, cause Ivan was talking about "2" and "4"-hour jobs. Therefore I could see that on all three Vm's new cmsrun's started twice. I changed my 'old' i7-2600 into an i9, you may have discovered ;) Besides of that from the 20 threads (10 cores), I had only 3 running except the first hours. There I had 4 threads in use for covid https://covid.si/en/stats/ Client ID: 7d61f743-caf6-ba70-c394-947e33ae4064 (third page) The rest idle for the most of the time. ID: 46162 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2549 Credit: 255,447,187 RAC: 66,454	Message 46163 - Posted: 3 Feb 2022, 13:21:02 UTC - in response to Message 46162. Would you mind running another one and compare #processed records on console 2 against the runtime shown for cmsRun on console 3? This should give the processing rate as records/minute (after a while to be more accurate). from the 20 threads (10 cores), I had only 3 running except the first hours. ... The rest idle for the most of the time This would - at least partly - explain extremely high processing rates. ID: 46163 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2245 Credit: 174,025,522 RAC: 9,726	Message 46164 - Posted: 3 Feb 2022, 14:50:26 UTC After 6 hours 20k is done. Seeing this info in masterlog every 5 Minutes: 02/03/22 15:39:59 (pid:16256) CONFIGURATION PROBLEM: Failed to insert ClassAd attribute GLIDEIN_Resource_Slots = Iotokens,80,,type=main. The most common reason for this is that you forgot to quote a string value in the list of attributes being added to the MASTER ad. 02/03/22 15:39:59 (pid:16256) CONFIGURATION PROBLEM: Failed to insert ClassAd attribute STARTD_JOB_ATTRS = x509userproxysubject x509UserProxyFQAN x509UserProxyVOName x509UserProxyEmail x509UserProxyExpiration,MemoryUsage,ResidentSetSize,ProportionalSetSizeKb. The most common reason for this is that you forgot to quote a string value in the list of attributes being added to the MASTER ad. 02/03/22 15:39:59 (pid:16256) CONFIGURATION PROBLEM: Failed to insert ClassAd attribute STARTD_PARTITIONABLE_SLOT_ATTRS = MemoryUsage,ProportionalSetSizeKb. The most common reason for this is that you forgot to quote a string value in the list of attributes being added to the MASTER ad. ID: 46164 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1429 Credit: 9,539,339 RAC: 5,065	Message 46165 - Posted: 3 Feb 2022, 15:58:55 UTC - in response to Message 46163. Would you mind running another one and compare #processed records on console 2 against the runtime shown for cmsRun on console 3? This should give the processing rate as records/minute (after a while to be more accurate). from the 20 threads (10 cores), I had only 3 running except the first hours. ... The rest idle for the most of the time This would - at least partly - explain extremely high processing rates. You seem hard to be convinced ;) From the finished_1.log: Begin processing the 1st record. Run 1, Event 59740001, LumiSection 119481 on stream 0 at 03-Feb-2022 15:18:56.085 CET . Begin processing the 10001st record. Run 1, Event 59750001, LumiSection 119501 on stream 0 at 03-Feb-2022 16:07:05.436 CET . Begin processing the 20000th record. Run 1, Event 59760000, LumiSection 119520 on stream 0 at 03-Feb-2022 16:54:06.629 CET CPU time from cmsRun 93:34:19 ID: 46165 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2245 Credit: 174,025,522 RAC: 9,726	Message 46166 - Posted: 3 Feb 2022, 16:02:13 UTC - in response to Message 46164. process id is 391 status is 0. Now 7 hours including the last 15 #min. idle. Waiting now for ending or a second task. ID: 46166 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1429 Credit: 9,539,339 RAC: 5,065	Message 46167 - Posted: 3 Feb 2022, 16:22:20 UTC @Ivan: The contents of wmagentJob.log only shows the running job. The previous job info is overwritten. Could you create wmagentJob_1.log, wmagentJob_2.log etc, if you stick to running more jobs in 1 VM-lifetime. Or just append the info of the 2nd, 3rd etc job to 1 wmagentJob.log ID: 46167 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2549 Credit: 255,447,187 RAC: 66,454	Message 46168 - Posted: 3 Feb 2022, 17:04:28 UTC - in response to Message 46165. You seem hard to be convinced ;) Just impressed ;-D ID: 46168 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1065 Credit: 7,934,196 RAC: 14,364	Message 46194 - Posted: 8 Feb 2022, 15:01:45 UTC - in response to Message 46158. It appears that the tasks currently in the queue are running jobs with 20000 records (before: 10000) and that the tasks shut down after 1 job (before: more than 1). Is this by intention? On my slowest box (i7-3770K) this results in task wall times around 11 h. It's a result of the various limits, I think. Thanks for the feedback. ID: 46194 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1065 Credit: 7,934,196 RAC: 14,364	Message 46195 - Posted: 8 Feb 2022, 15:04:48 UTC - in response to Message 46164. After 6 hours 20k is done. Seeing this info in masterlog every 5 Minutes: 02/03/22 15:39:59 (pid:16256) CONFIGURATION PROBLEM: Failed to insert ClassAd attribute GLIDEIN_Resource_Slots = Iotokens,80,,type=main. The most common reason for this is that you forgot to quote a string value in the list of attributes being added to the MASTER ad. 02/03/22 15:39:59 (pid:16256) CONFIGURATION PROBLEM: Failed to insert ClassAd attribute STARTD_JOB_ATTRS = x509userproxysubject x509UserProxyFQAN x509UserProxyVOName x509UserProxyEmail x509UserProxyExpiration,MemoryUsage,ResidentSetSize,ProportionalSetSizeKb. The most common reason for this is that you forgot to quote a string value in the list of attributes being added to the MASTER ad. 02/03/22 15:39:59 (pid:16256) CONFIGURATION PROBLEM: Failed to insert ClassAd attribute STARTD_PARTITIONABLE_SLOT_ATTRS = MemoryUsage,ProportionalSetSizeKb. The most common reason for this is that you forgot to quote a string value in the list of attributes being added to the MASTER ad. I've been told that that is not a problem per se and it doesn't affect the jobs. Still, it'd be nice to be rid of it... ID: 46195 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1065 Credit: 7,934,196 RAC: 14,364	Message 46196 - Posted: 8 Feb 2022, 15:06:47 UTC - in response to Message 46167. @Ivan: The contents of wmagentJob.log only shows the running job. The previous job info is overwritten. Could you create wmagentJob_1.log, wmagentJob_2.log etc, if you stick to running more jobs in 1 VM-lifetime. Or just append the info of the 2nd, 3rd etc job to 1 wmagentJob.log I'll ask Laurence, that's his area of responsibility. ID: 46196 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1065 Credit: 7,934,196 RAC: 14,364	Message 46197 - Posted: 8 Feb 2022, 15:40:37 UTC Last modified: 8 Feb 2022, 15:49:21 UTC The graph of average job time over several days is instructive, a) At around 1200 on 31/1 we ran out of jobs as we drained the queues for the WMAgent update. Before that, the 10,000-event jobs were showing an average of just over 2 hours. As the faster machines ran out of jobs, the average job time increased as slower machines reported in with longer run times. By the time the update was started, the remaining few very slow machines were averaging over 6 hrs/job. b) On the 1/2 I ran some large jobs from a workflow we are investigating, but its run-times were too short for our purposes -- 5 minutes to produce 60 MB of output. c) I then started "four-hour" jobs, of 20,000 events which started running about 2000 on 1/2. You can see that the fastest machines reported run-times of around two hours, but the average then crept slowly up to four hours as the slower machines started reporting in. d) When I submitted 40,000-event jobs, they didn't run because of a time-to-run mismatch in condor (1200 on 4/2). Again, as the queue drained the average time started increasing as the fastest machines dropped out of the reporting. e) When I submitted 10,000-event jobs again on 5/2, the average quickly stabilised to the long-term value of two hours. The graph below shows a histogram of run-times for a batch of 10,000-event jobs. We need to fold these observations into our configurations as we move forward.[/img] ID: 46197 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1065 Credit: 7,934,196 RAC: 14,364	Message 46198 - Posted: 8 Feb 2022, 16:47:23 UTC - in response to Message 46196. @Ivan: The contents of wmagentJob.log only shows the running job. The previous job info is overwritten. Could you create wmagentJob_1.log, wmagentJob_2.log etc, if you stick to running more jobs in 1 VM-lifetime. Or just append the info of the 2nd, 3rd etc job to 1 wmagentJob.log I'll ask Laurence, that's his area of responsibility. I'm told this will be implemented tomorrow. ID: 46198 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1429 Credit: 9,539,339 RAC: 5,065	Message 46200 - Posted: 8 Feb 2022, 17:33:22 UTC - in response to Message 46198. @Ivan: The contents of wmagentJob.log only shows the running job. The previous job info is overwritten. Could you create wmagentJob_1.log, wmagentJob_2.log etc, if you stick to running more jobs in 1 VM-lifetime. Or just append the info of the 2nd, 3rd etc job to 1 wmagentJob.log I'll ask Laurence, that's his area of responsibility. I'm told this will be implemented tomorrow. Thanks Ivan (and Laurence). Are you considering to move to "4-hour" jobs again in the near future or even to the CMS-standard of 40,000 events ("8-hour job"). If the latter: Have a talk with Laurence, whether it's possible somehow to get rid of the VM-minimum liftetime of 12 hours and the max of 18 hours. In my opinion it would be better, when running the avg 8 hour-jobs, to only run 1 job with 1 BOINC-task. Maybe it.s even possible to send the CMS-job with the BOINC-task to the client and no longer requesting the job by the Virtual Machine, because it's already in the shared directory then. ID: 46200 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1065 Credit: 7,934,196 RAC: 14,364	Message 46201 - Posted: 8 Feb 2022, 19:10:14 UTC - in response to Message 46200. Last modified: 8 Feb 2022, 19:11:32 UTC Are you considering to move to "4-hour" jobs again in the near future or even to the CMS-standard of 40,000 events ("8-hour job"). If the latter: Have a talk with Laurence, whether it's possible somehow to get rid of the VM-minimum liftetime of 12 hours and the max of 18 hours. In my opinion it would be better, when running the avg 8 hour-jobs, to only run 1 job with 1 BOINC-task. Maybe it.s even possible to send the CMS-job with the BOINC-task to the client and no longer requesting the job by the Virtual Machine, because it's already in the shared directory then. Not in the immediate future. It was an exploratory exercise in anticipation of a request from formal production. If/when that happens, we will surely make adjustments such as you suggest to better cope with the varied volunteer resources. As I think you know, we do run benchmarks at the start of each task, but there's no way that I'm aware of that we can feed that back -- ultimately to WMAgent -- to tailor the number of events per job to the power of the machine at hand. ID: 46201 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1065 Credit: 7,934,196 RAC: 14,364	Message 46208 - Posted: 9 Feb 2022, 22:41:51 UTC - in response to Message 46198. @Ivan: The contents of wmagentJob.log only shows the running job. The previous job info is overwritten. Could you create wmagentJob_1.log, wmagentJob_2.log etc, if you stick to running more jobs in 1 VM-lifetime. Or just append the info of the 2nd, 3rd etc job to 1 wmagentJob.log I'll ask Laurence, that's his area of responsibility. I'm told this will be implemented tomorrow. Anyone noticing if this is working? I though I'd left enough time for Laurence's implementation to trickle down to CVMFS, but my jobs still have only the current log in the web interface. I'm letting the project run overnight to see if the next task still has the same behavior. ID: 46208 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1065 Credit: 7,934,196 RAC: 14,364	Message 46209 - Posted: 9 Feb 2022, 22:52:36 UTC - in response to Message 46208. behavior Apologies for letting BOINC's spill-chucker bully me into using US spelling. It is, of course, "behaviour"! ID: 46209 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1065 Credit: 7,934,196 RAC: 14,364	Message 46212 - Posted: 10 Feb 2022, 10:17:12 UTC - in response to Message 46208. @Ivan: The contents of wmagentJob.log only shows the running job. The previous job info is overwritten. Could you create wmagentJob_1.log, wmagentJob_2.log etc, if you stick to running more jobs in 1 VM-lifetime. Or just append the info of the 2nd, 3rd etc job to 1 wmagentJob.log I'll ask Laurence, that's his area of responsibility. I'm told this will be implemented tomorrow. Anyone noticing if this is working? I though I'd left enough time for Laurence's implementation to trickle down to CVMFS, but my jobs still have only the current log in the web interface. I'm letting the project run overnight to see if the next task still has the same behavior. No, it's not working for me. The log must be deleted before the copy command is executed. I guess the logic needs a re-think. ID: 46212 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1065 Credit: 7,934,196 RAC: 14,364	Message 46215 - Posted: 10 Feb 2022, 16:45:19 UTC - in response to Message 46212. @Ivan: The contents of wmagentJob.log only shows the running job. The previous job info is overwritten. Could you create wmagentJob_1.log, wmagentJob_2.log etc, if you stick to running more jobs in 1 VM-lifetime. Or just append the info of the 2nd, 3rd etc job to 1 wmagentJob.log I'll ask Laurence, that's his area of responsibility. I'm told this will be implemented tomorrow. Anyone noticing if this is working? I though I'd left enough time for Laurence's implementation to trickle down to CVMFS, but my jobs still have only the current log in the web interface. I'm letting the project run overnight to see if the next task still has the same behavior. No, it's not working for me. The log must be deleted before the copy command is executed. I guess the logic needs a re-think. This will be debugged when time allows. ID: 46215 · Reply Quote

LHC@home