Message boards : CMS Application : EXIT_NO_SUB_TASKS
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 12 · 13 · 14 · 15

AuthorMessage
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 883
Credit: 5,852,906
RAC: 65
Message 46222 - Posted: 10 Feb 2022, 22:20:41 UTC - in response to Message 46220.  

Hi Ivan,

I though this was set by Lawrence? the runtime should be 12 h, but there is a timeout at 16 h so they don't run forever if there is an issue?

The BOINC task timeout is set at 18 hours, last time I was aware. This is different, it's actually to do with the "glidein" that asks for jobs from condor. The relevant criterion is that (16*3600) < GlideInTimeToDie(seconds from epoch) - CurrentTime(seconds). I haven't examined it in detail, but I suspect that glidein-lifetime is set to 16 hours, but by the time the matchmaker makes its test, 0.02 hours have elapsed, so the criterion fails. In any event, I've rowed back on longer jobs due to the problems they caused, as well as some dissension from the volunteers here.
ID: 46222 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jun 08
Posts: 2029
Credit: 148,973,105
RAC: 121,433
Message 46223 - Posted: 11 Feb 2022, 6:11:02 UTC - in response to Message 46222.  

The BOINC task timeout is set at 18 hours, last time I was aware.

This is set by <job_duration>64800</job_duration> (64800 s) in CMS_2016_03_22.xml, hence nearly 6 years ago.
The limit is used by vboxwrapper no matter what the VM is doing at that moment and no matter how long and how often the VM was paused before.

Am I right that the Glidein timers don't expect a job to be paused?
I suspect the calculation Ivan mentioned just estimates whether a computer is fast enough to finish the next job within the Condor limit.

If the job setup (say) doubles the job runtime, the grace period at the end of a task may also need to be extended.
=> <job_duration> may need to be set higher.



Another point:
A couple of "top computers" (older CPUs!) appear to run complete CMS tasks within much less than 1 h (~40 min CPU-time).
This seems to be unrealistic compared to CP's brand new box that needs 1.5 h.

The logs don't point out any error on the level we can see.
Their tasks get validated because Glidein reports "0" (success).
I suspect they either do not get a job at all or the jobs fail due to an error that is not correctly reported back via Glidein.

Don't want to blame a user for that because it's unlikely the error is caused client side.
ID: 46223 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 1590
Credit: 67,703,543
RAC: 240,055
Message 46342 - Posted: 24 Feb 2022, 13:47:11 UTC

Seeing atm own PC's with no job inside the task.
ID: 46342 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 883
Credit: 5,852,906
RAC: 65
Message 46349 - Posted: 24 Feb 2022, 15:52:57 UTC - in response to Message 46223.  

The BOINC task timeout is set at 18 hours, last time I was aware.

This is set by 64800 (64800 s) in CMS_2016_03_22.xml, hence nearly 6 years ago.
The limit is used by vboxwrapper no matter what the VM is doing at that moment and no matter how long and how often the VM was paused before.

Am I right that the Glidein timers don't expect a job to be paused?
I suspect the calculation Ivan mentioned just estimates whether a computer is fast enough to finish the next job within the Condor limit.

If the job setup (say) doubles the job runtime, the grace period at the end of a task may also need to be extended.
=> may need to be set higher.



Another point:
A couple of "top computers" (older CPUs!) appear to run complete CMS tasks within much less than 1 h (~40 min CPU-time).
This seems to be unrealistic compared to CP's brand new box that needs 1.5 h.

The logs don't point out any error on the level we can see.
Their tasks get validated because Glidein reports "0" (success).
I suspect they either do not get a job at all or the jobs fail due to an error that is not correctly reported back via Glidein.

Don't want to blame a user for that because it's unlikely the error is caused client side.

I'm definitely not an expert on the glideins, but jobs can be paused and restarted within a certain time window and pick up the task again. Glidein notices if the VM has paused:
2022-02-18 14:46:29 (2721624): Guest Log: 00:19:40.614140 timesync vgsvcTimeSyncWorker: Radical host time change: 2 796 084 000 000ns (HostNow=1 645 195 589 347 000 000 ns HostLast=1 645 192 793 263 000 000 ns)
2022-02-18 14:46:39 (2721624): Guest Log: 00:19:50.627041 timesync vgsvcTimeSyncWorker: Radical guest time change: 2 796 051 296 000ns (GuestNow=1 645 195 599 362 737 000 ns GuestLast=1 645 192 803 311 441 000 ns fSetTimeLastLoop=true )
and the job successfully continues, but if the time change is very large then the glidein does shut down:
2022-02-21 14:42:18 (3406431): Guest Log: 00:51:11.631597 timesync vgsvcTimeSyncWorker: Radical host time change: 257 068 200 000 000ns (HostNow=1 645 454 538 482 000 000 ns HostLast=1 645 197 470 282 000 000 ns)
2022-02-21 14:42:28 (3406431): Guest Log: 00:51:21.631995 timesync vgsvcTimeSyncWorker: Radical guest time change: 257 068 115 130 000ns (GuestNow=1 645 454 548 482 391 000 ns GuestLast=1 645 197 480 367 261 000 ns fSetTimeLastLoop=true )
2022-02-21 14:47:29 (3406431): Guest Log: [INFO] glidein exited with return value 0.
2022-02-21 14:47:29 (3406431): Guest Log: [INFO] Shutting Down.

We think a problem is that condor has a 20 minute time-out on the connection to the VM. After that time, it resubmits the job to the pool. If the job is still unclaimed when the VM restarts, things seem to run OK, but if the job has started running on another machine in the interim then that seems to lead to orphaned jobs, or VMs idling until the BOINC task times out.
As for machines reporting suspiciously short run-times, we think those are mainly jobs with failures. When Laurence is back from holiday we will ask him to try to pass the condor exit code back to the BOINC task and have the task fail (or at least terminate immediately) if it indicates an error. This hasn't been a priority until now, and there are arguments against it, but when we get large clusters of failures such as lately we probably need to ensure that the volunteer continues, blithely ignorant that his/her machine is not doing useful work.
ID: 46349 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 883
Credit: 5,852,906
RAC: 65
Message 46618 - Posted: 12 Apr 2022, 16:02:44 UTC

I'm draining the job queue in anticipation of a WMAgent upgrade tomorrow. Jobs will probably start to run out around 1200GMT, so set No New Tasks late tonight or early in the morning.
ID: 46618 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 883
Credit: 5,852,906
RAC: 65
Message 46625 - Posted: 13 Apr 2022, 18:31:52 UTC - in response to Message 46618.  

I'm draining the job queue in anticipation of a WMAgent upgrade tomorrow. Jobs will probably start to run out around 1200GMT, so set No New Tasks late tonight or early in the morning.

The update is done, and I've injected a new workflow. With any luck new tasks/jobs will be available within the next hour or so.
ID: 46625 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 883
Credit: 5,852,906
RAC: 65
Message 46626 - Posted: 13 Apr 2022, 19:28:59 UTC - in response to Message 46625.  

I'm draining the job queue in anticipation of a WMAgent upgrade tomorrow. Jobs will probably start to run out around 1200GMT, so set No New Tasks late tonight or early in the morning.

The update is done, and I've injected a new workflow. With any luck new tasks/jobs will be available within the next hour or so.

...and... we're off again! Thanks everyone.
ID: 46626 · Report as offensive     Reply Quote
NOGOOD

Send message
Joined: 18 Nov 17
Posts: 118
Credit: 41,026,294
RAC: 7,776
Message 46627 - Posted: 13 Apr 2022, 19:41:40 UTC - in response to Message 46626.  

I'm draining the job queue in anticipation of a WMAgent upgrade tomorrow. Jobs will probably start to run out around 1200GMT, so set No New Tasks late tonight or early in the morning.

The update is done, and I've injected a new workflow. With any luck new tasks/jobs will be available within the next hour or so.

...and... we're off again! Thanks everyone.


We believe in you and look forward to the opportunity to continue cooperation :-)
ID: 46627 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 883
Credit: 5,852,906
RAC: 65
Message 46647 - Posted: 17 Apr 2022, 13:33:44 UTC - in response to Message 46627.  

I'm draining the job queue in anticipation of a WMAgent upgrade tomorrow. Jobs will probably start to run out around 1200GMT, so set No New Tasks late tonight or early in the morning.

The update is done, and I've injected a new workflow. With any luck new tasks/jobs will be available within the next hour or so.

...and... we're off again! Thanks everyone.


We believe in you and look forward to the opportunity to continue cooperation :-)

Thanks. Unfortunately there is a problem at CERN this (holiday) weekend, and I've not been able to inject a new batch of jobs. Unless something changes soon we will run out of work in about six hours. Sorry 'bout that, but I don't see anything I can do to change it. I'll try to raise a trouble ticket.
ID: 46647 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 883
Credit: 5,852,906
RAC: 65
Message 46648 - Posted: 17 Apr 2022, 13:53:05 UTC - in response to Message 46647.  

Ah, it looks like it might be a certificate expiry problem:
    ssl.SSLError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:618)
    2022-04-17 15:50:23,368:INFO:inject-test-wfs: TC_SLC7.json request FAILED injection!


ID: 46648 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 883
Credit: 5,852,906
RAC: 65
Message 46649 - Posted: 17 Apr 2022, 15:17:05 UTC - in response to Message 46647.  

https://cern.service-now.com/service-portal?id=ticket&table=incident&sys_id=b4c8b86787f24150eb3b33390cbb35a5&view=sp
ID: 46649 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 883
Credit: 5,852,906
RAC: 65
Message 46650 - Posted: 19 Apr 2022, 0:41:59 UTC

I've just succeeded in injecting a workflow. Jobs should be available again soon.
ID: 46650 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 883
Credit: 5,852,906
RAC: 65
Message 46651 - Posted: 19 Apr 2022, 1:23:11 UTC - in response to Message 46650.  

We seem to have jobs running again: https://lhcathome.cern.ch/lhcathome/cms_job.php
ID: 46651 · Report as offensive     Reply Quote
Previous · 1 . . . 12 · 13 · 14 · 15

Message boards : CMS Application : EXIT_NO_SUB_TASKS


©2022 CERN