Message boards : CMS Application : EXIT_NO_SUB_TASKS
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 12 · 13 · 14 · 15 · 16 · Next

AuthorMessage
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 990
Credit: 6,264,307
RAC: 191
Message 46222 - Posted: 10 Feb 2022, 22:20:41 UTC - in response to Message 46220.  

Hi Ivan,

I though this was set by Lawrence? the runtime should be 12 h, but there is a timeout at 16 h so they don't run forever if there is an issue?

The BOINC task timeout is set at 18 hours, last time I was aware. This is different, it's actually to do with the "glidein" that asks for jobs from condor. The relevant criterion is that (16*3600) < GlideInTimeToDie(seconds from epoch) - CurrentTime(seconds). I haven't examined it in detail, but I suspect that glidein-lifetime is set to 16 hours, but by the time the matchmaker makes its test, 0.02 hours have elapsed, so the criterion fails. In any event, I've rowed back on longer jobs due to the problems they caused, as well as some dissension from the volunteers here.
ID: 46222 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2375
Credit: 221,710,130
RAC: 142,952
Message 46223 - Posted: 11 Feb 2022, 6:11:02 UTC - in response to Message 46222.  

The BOINC task timeout is set at 18 hours, last time I was aware.

This is set by <job_duration>64800</job_duration> (64800 s) in CMS_2016_03_22.xml, hence nearly 6 years ago.
The limit is used by vboxwrapper no matter what the VM is doing at that moment and no matter how long and how often the VM was paused before.

Am I right that the Glidein timers don't expect a job to be paused?
I suspect the calculation Ivan mentioned just estimates whether a computer is fast enough to finish the next job within the Condor limit.

If the job setup (say) doubles the job runtime, the grace period at the end of a task may also need to be extended.
=> <job_duration> may need to be set higher.



Another point:
A couple of "top computers" (older CPUs!) appear to run complete CMS tasks within much less than 1 h (~40 min CPU-time).
This seems to be unrealistic compared to CP's brand new box that needs 1.5 h.

The logs don't point out any error on the level we can see.
Their tasks get validated because Glidein reports "0" (success).
I suspect they either do not get a job at all or the jobs fail due to an error that is not correctly reported back via Glidein.

Don't want to blame a user for that because it's unlikely the error is caused client side.
ID: 46223 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2066
Credit: 155,484,578
RAC: 165,038
Message 46342 - Posted: 24 Feb 2022, 13:47:11 UTC

Seeing atm own PC's with no job inside the task.
ID: 46342 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 990
Credit: 6,264,307
RAC: 191
Message 46349 - Posted: 24 Feb 2022, 15:52:57 UTC - in response to Message 46223.  

The BOINC task timeout is set at 18 hours, last time I was aware.

This is set by 64800 (64800 s) in CMS_2016_03_22.xml, hence nearly 6 years ago.
The limit is used by vboxwrapper no matter what the VM is doing at that moment and no matter how long and how often the VM was paused before.

Am I right that the Glidein timers don't expect a job to be paused?
I suspect the calculation Ivan mentioned just estimates whether a computer is fast enough to finish the next job within the Condor limit.

If the job setup (say) doubles the job runtime, the grace period at the end of a task may also need to be extended.
=> may need to be set higher.



Another point:
A couple of "top computers" (older CPUs!) appear to run complete CMS tasks within much less than 1 h (~40 min CPU-time).
This seems to be unrealistic compared to CP's brand new box that needs 1.5 h.

The logs don't point out any error on the level we can see.
Their tasks get validated because Glidein reports "0" (success).
I suspect they either do not get a job at all or the jobs fail due to an error that is not correctly reported back via Glidein.

Don't want to blame a user for that because it's unlikely the error is caused client side.

I'm definitely not an expert on the glideins, but jobs can be paused and restarted within a certain time window and pick up the task again. Glidein notices if the VM has paused:
2022-02-18 14:46:29 (2721624): Guest Log: 00:19:40.614140 timesync vgsvcTimeSyncWorker: Radical host time change: 2 796 084 000 000ns (HostNow=1 645 195 589 347 000 000 ns HostLast=1 645 192 793 263 000 000 ns)
2022-02-18 14:46:39 (2721624): Guest Log: 00:19:50.627041 timesync vgsvcTimeSyncWorker: Radical guest time change: 2 796 051 296 000ns (GuestNow=1 645 195 599 362 737 000 ns GuestLast=1 645 192 803 311 441 000 ns fSetTimeLastLoop=true )
and the job successfully continues, but if the time change is very large then the glidein does shut down:
2022-02-21 14:42:18 (3406431): Guest Log: 00:51:11.631597 timesync vgsvcTimeSyncWorker: Radical host time change: 257 068 200 000 000ns (HostNow=1 645 454 538 482 000 000 ns HostLast=1 645 197 470 282 000 000 ns)
2022-02-21 14:42:28 (3406431): Guest Log: 00:51:21.631995 timesync vgsvcTimeSyncWorker: Radical guest time change: 257 068 115 130 000ns (GuestNow=1 645 454 548 482 391 000 ns GuestLast=1 645 197 480 367 261 000 ns fSetTimeLastLoop=true )
2022-02-21 14:47:29 (3406431): Guest Log: [INFO] glidein exited with return value 0.
2022-02-21 14:47:29 (3406431): Guest Log: [INFO] Shutting Down.

We think a problem is that condor has a 20 minute time-out on the connection to the VM. After that time, it resubmits the job to the pool. If the job is still unclaimed when the VM restarts, things seem to run OK, but if the job has started running on another machine in the interim then that seems to lead to orphaned jobs, or VMs idling until the BOINC task times out.
As for machines reporting suspiciously short run-times, we think those are mainly jobs with failures. When Laurence is back from holiday we will ask him to try to pass the condor exit code back to the BOINC task and have the task fail (or at least terminate immediately) if it indicates an error. This hasn't been a priority until now, and there are arguments against it, but when we get large clusters of failures such as lately we probably need to ensure that the volunteer continues, blithely ignorant that his/her machine is not doing useful work.
ID: 46349 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 990
Credit: 6,264,307
RAC: 191
Message 46618 - Posted: 12 Apr 2022, 16:02:44 UTC

I'm draining the job queue in anticipation of a WMAgent upgrade tomorrow. Jobs will probably start to run out around 1200GMT, so set No New Tasks late tonight or early in the morning.
ID: 46618 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 990
Credit: 6,264,307
RAC: 191
Message 46625 - Posted: 13 Apr 2022, 18:31:52 UTC - in response to Message 46618.  

I'm draining the job queue in anticipation of a WMAgent upgrade tomorrow. Jobs will probably start to run out around 1200GMT, so set No New Tasks late tonight or early in the morning.

The update is done, and I've injected a new workflow. With any luck new tasks/jobs will be available within the next hour or so.
ID: 46625 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 990
Credit: 6,264,307
RAC: 191
Message 46626 - Posted: 13 Apr 2022, 19:28:59 UTC - in response to Message 46625.  

I'm draining the job queue in anticipation of a WMAgent upgrade tomorrow. Jobs will probably start to run out around 1200GMT, so set No New Tasks late tonight or early in the morning.

The update is done, and I've injected a new workflow. With any luck new tasks/jobs will be available within the next hour or so.

...and... we're off again! Thanks everyone.
ID: 46626 · Report as offensive     Reply Quote
NOGOOD

Send message
Joined: 18 Nov 17
Posts: 119
Credit: 51,138,199
RAC: 27,116
Message 46627 - Posted: 13 Apr 2022, 19:41:40 UTC - in response to Message 46626.  

I'm draining the job queue in anticipation of a WMAgent upgrade tomorrow. Jobs will probably start to run out around 1200GMT, so set No New Tasks late tonight or early in the morning.

The update is done, and I've injected a new workflow. With any luck new tasks/jobs will be available within the next hour or so.

...and... we're off again! Thanks everyone.


We believe in you and look forward to the opportunity to continue cooperation :-)
ID: 46627 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 990
Credit: 6,264,307
RAC: 191
Message 46647 - Posted: 17 Apr 2022, 13:33:44 UTC - in response to Message 46627.  

I'm draining the job queue in anticipation of a WMAgent upgrade tomorrow. Jobs will probably start to run out around 1200GMT, so set No New Tasks late tonight or early in the morning.

The update is done, and I've injected a new workflow. With any luck new tasks/jobs will be available within the next hour or so.

...and... we're off again! Thanks everyone.


We believe in you and look forward to the opportunity to continue cooperation :-)

Thanks. Unfortunately there is a problem at CERN this (holiday) weekend, and I've not been able to inject a new batch of jobs. Unless something changes soon we will run out of work in about six hours. Sorry 'bout that, but I don't see anything I can do to change it. I'll try to raise a trouble ticket.
ID: 46647 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 990
Credit: 6,264,307
RAC: 191
Message 46648 - Posted: 17 Apr 2022, 13:53:05 UTC - in response to Message 46647.  

Ah, it looks like it might be a certificate expiry problem:
    ssl.SSLError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:618)
    2022-04-17 15:50:23,368:INFO:inject-test-wfs: TC_SLC7.json request FAILED injection!


ID: 46648 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 990
Credit: 6,264,307
RAC: 191
Message 46649 - Posted: 17 Apr 2022, 15:17:05 UTC - in response to Message 46647.  

https://cern.service-now.com/service-portal?id=ticket&table=incident&sys_id=b4c8b86787f24150eb3b33390cbb35a5&view=sp
ID: 46649 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 990
Credit: 6,264,307
RAC: 191
Message 46650 - Posted: 19 Apr 2022, 0:41:59 UTC

I've just succeeded in injecting a workflow. Jobs should be available again soon.
ID: 46650 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 990
Credit: 6,264,307
RAC: 191
Message 46651 - Posted: 19 Apr 2022, 1:23:11 UTC - in response to Message 46650.  

We seem to have jobs running again: https://lhcathome.cern.ch/lhcathome/cms_job.php
ID: 46651 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 990
Credit: 6,264,307
RAC: 191
Message 47248 - Posted: 13 Sep 2022, 11:00:07 UTC

Sorry, I had an off day yesterday and we started running out of jobs this morning. More are being created and should be in the pipeline soon.
ID: 47248 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 990
Credit: 6,264,307
RAC: 191
Message 47249 - Posted: 13 Sep 2022, 14:26:49 UTC

I'm afraid we are going to have a little more disruption, to accommodate a WMAgent upgrade. I don't have the full details to hand yet, but we'll try to get it all done before the weekend. Keep an eye on your machines and the job plots, and be ready to set No New Tasks if necessary.
ID: 47249 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 990
Credit: 6,264,307
RAC: 191
Message 47259 - Posted: 15 Sep 2022, 17:52:54 UTC - in response to Message 47249.  

I'm afraid we are going to have a little more disruption, to accommodate a WMAgent upgrade. I don't have the full details to hand yet, but we'll try to get it all done before the weekend. Keep an eye on your machines and the job plots, and be ready to set No New Tasks if necessary.

OK, jobs are starting to become available again. Sorry about the delay, but there are some things I can't control.
ID: 47259 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 990
Credit: 6,264,307
RAC: 191
Message 47262 - Posted: 17 Sep 2022, 12:13:40 UTC

Hmm, spoke too soon. Job creation stopped early this morning -- it might have affected other production machines for a while, but they are up again now. We aren't, our queues are empty.
Messages from WMAgent suggest a proxy certificate has expired, for us and at least one other Agent on cmsweb-testbed! (Why does that always happen on a weekend?)
ID: 47262 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 990
Credit: 6,264,307
RAC: 191
Message 47266 - Posted: 19 Sep 2022, 11:27:38 UTC - in response to Message 47262.  

We're back. Someone waved the right chicken entrails and muttered the right curse, but they haven't owned up to it yet!
ID: 47266 · Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 1112
Credit: 49,479,463
RAC: 6,461
Message 47268 - Posted: 19 Sep 2022, 22:43:53 UTC - in response to Message 47266.  



Thanks Ivan, I will fire up my chicken cookers later tonight.
ID: 47268 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 990
Credit: 6,264,307
RAC: 191
Message 47283 - Posted: 23 Sep 2022, 15:35:52 UTC - in response to Message 47266.  

We also had an outage due to a new python library that was leaving file handles open until a limit was reached (I think; it illustrates the problems of using third-party code!). I think we are all fixed again now.
ID: 47283 · Report as offensive     Reply Quote
Previous · 1 . . . 12 · 13 · 14 · 15 · 16 · Next

Message boards : CMS Application : EXIT_NO_SUB_TASKS


©2024 CERN