EXIT_NO_SUB

Author	Message
ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1006 Credit: 6,272,156 RAC: 374	Message 46222 - Posted: 10 Feb 2022, 22:20:41 UTC - in response to Message 46220. Hi Ivan, I though this was set by Lawrence? the runtime should be 12 h, but there is a timeout at 16 h so they don't run forever if there is an issue? The BOINC task timeout is set at 18 hours, last time I was aware. This is different, it's actually to do with the "glidein" that asks for jobs from condor. The relevant criterion is that (16*3600) < GlideInTimeToDie(seconds from epoch) - CurrentTime(seconds). I haven't examined it in detail, but I suspect that glidein-lifetime is set to 16 hours, but by the time the matchmaker makes its test, 0.02 hours have elapsed, so the criterion fails. In any event, I've rowed back on longer jobs due to the problems they caused, as well as some dissension from the volunteers here. ID: 46222 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2411 Credit: 226,351,364 RAC: 132,251	Message 46223 - Posted: 11 Feb 2022, 6:11:02 UTC - in response to Message 46222. The BOINC task timeout is set at 18 hours, last time I was aware. This is set by <job_duration>64800</job_duration> (64800 s) in CMS_2016_03_22.xml, hence nearly 6 years ago. The limit is used by vboxwrapper no matter what the VM is doing at that moment and no matter how long and how often the VM was paused before. Am I right that the Glidein timers don't expect a job to be paused? I suspect the calculation Ivan mentioned just estimates whether a computer is fast enough to finish the next job within the Condor limit. If the job setup (say) doubles the job runtime, the grace period at the end of a task may also need to be extended. => <job_duration> may need to be set higher. Another point: A couple of "top computers" (older CPUs!) appear to run complete CMS tasks within much less than 1 h (~40 min CPU-time). This seems to be unrealistic compared to CP's brand new box that needs 1.5 h. The logs don't point out any error on the level we can see. Their tasks get validated because Glidein reports "0" (success). I suspect they either do not get a job at all or the jobs fail due to an error that is not correctly reported back via Glidein. Don't want to blame a user for that because it's unlikely the error is caused client side. ID: 46223 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2099 Credit: 159,815,788 RAC: 143,603	Message 46342 - Posted: 24 Feb 2022, 13:47:11 UTC Seeing atm own PC's with no job inside the task. ID: 46342 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1006 Credit: 6,272,156 RAC: 374	Message 46349 - Posted: 24 Feb 2022, 15:52:57 UTC - in response to Message 46223. The BOINC task timeout is set at 18 hours, last time I was aware. This is set by 64800 (64800 s) in CMS_2016_03_22.xml, hence nearly 6 years ago. The limit is used by vboxwrapper no matter what the VM is doing at that moment and no matter how long and how often the VM was paused before. Am I right that the Glidein timers don't expect a job to be paused? I suspect the calculation Ivan mentioned just estimates whether a computer is fast enough to finish the next job within the Condor limit. If the job setup (say) doubles the job runtime, the grace period at the end of a task may also need to be extended. => may need to be set higher. Another point: A couple of "top computers" (older CPUs!) appear to run complete CMS tasks within much less than 1 h (~40 min CPU-time). This seems to be unrealistic compared to CP's brand new box that needs 1.5 h. The logs don't point out any error on the level we can see. Their tasks get validated because Glidein reports "0" (success). I suspect they either do not get a job at all or the jobs fail due to an error that is not correctly reported back via Glidein. Don't want to blame a user for that because it's unlikely the error is caused client side. I'm definitely not an expert on the glideins, but jobs can be paused and restarted within a certain time window and pick up the task again. Glidein notices if the VM has paused: 2022-02-18 14:46:29 (2721624): Guest Log: 00:19:40.614140 timesync vgsvcTimeSyncWorker: Radical host time change: 2 796 084 000 000ns (HostNow=1 645 195 589 347 000 000 ns HostLast=1 645 192 793 263 000 000 ns) 2022-02-18 14:46:39 (2721624): Guest Log: 00:19:50.627041 timesync vgsvcTimeSyncWorker: Radical guest time change: 2 796 051 296 000ns (GuestNow=1 645 195 599 362 737 000 ns GuestLast=1 645 192 803 311 441 000 ns fSetTimeLastLoop=true ) and the job successfully continues, but if the time change is very large then the glidein does shut down: 2022-02-21 14:42:18 (3406431): Guest Log: 00:51:11.631597 timesync vgsvcTimeSyncWorker: Radical host time change: 257 068 200 000 000ns (HostNow=1 645 454 538 482 000 000 ns HostLast=1 645 197 470 282 000 000 ns) 2022-02-21 14:42:28 (3406431): Guest Log: 00:51:21.631995 timesync vgsvcTimeSyncWorker: Radical guest time change: 257 068 115 130 000ns (GuestNow=1 645 454 548 482 391 000 ns GuestLast=1 645 197 480 367 261 000 ns fSetTimeLastLoop=true ) 2022-02-21 14:47:29 (3406431): Guest Log: [INFO] glidein exited with return value 0. 2022-02-21 14:47:29 (3406431): Guest Log: [INFO] Shutting Down. We think a problem is that condor has a 20 minute time-out on the connection to the VM. After that time, it resubmits the job to the pool. If the job is still unclaimed when the VM restarts, things seem to run OK, but if the job has started running on another machine in the interim then that seems to lead to orphaned jobs, or VMs idling until the BOINC task times out. As for machines reporting suspiciously short run-times, we think those are mainly jobs with failures. When Laurence is back from holiday we will ask him to try to pass the condor exit code back to the BOINC task and have the task fail (or at least terminate immediately) if it indicates an error. This hasn't been a priority until now, and there are arguments against it, but when we get large clusters of failures such as lately we probably need to ensure that the volunteer continues, blithely ignorant that his/her machine is not doing useful work. ID: 46349 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1006 Credit: 6,272,156 RAC: 374	Message 46618 - Posted: 12 Apr 2022, 16:02:44 UTC I'm draining the job queue in anticipation of a WMAgent upgrade tomorrow. Jobs will probably start to run out around 1200GMT, so set No New Tasks late tonight or early in the morning. ID: 46618 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1006 Credit: 6,272,156 RAC: 374	Message 46625 - Posted: 13 Apr 2022, 18:31:52 UTC - in response to Message 46618. I'm draining the job queue in anticipation of a WMAgent upgrade tomorrow. Jobs will probably start to run out around 1200GMT, so set No New Tasks late tonight or early in the morning. The update is done, and I've injected a new workflow. With any luck new tasks/jobs will be available within the next hour or so. ID: 46625 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1006 Credit: 6,272,156 RAC: 374	Message 46626 - Posted: 13 Apr 2022, 19:28:59 UTC - in response to Message 46625. I'm draining the job queue in anticipation of a WMAgent upgrade tomorrow. Jobs will probably start to run out around 1200GMT, so set No New Tasks late tonight or early in the morning. The update is done, and I've injected a new workflow. With any luck new tasks/jobs will be available within the next hour or so. ...and... we're off again! Thanks everyone. ID: 46626 · Reply Quote

NOGOOD Send message Joined: 18 Nov 17 Posts: 120 Credit: 52,022,617 RAC: 26,305	Message 46627 - Posted: 13 Apr 2022, 19:41:40 UTC - in response to Message 46626. I'm draining the job queue in anticipation of a WMAgent upgrade tomorrow. Jobs will probably start to run out around 1200GMT, so set No New Tasks late tonight or early in the morning. The update is done, and I've injected a new workflow. With any luck new tasks/jobs will be available within the next hour or so. ...and... we're off again! Thanks everyone. We believe in you and look forward to the opportunity to continue cooperation :-) ID: 46627 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1006 Credit: 6,272,156 RAC: 374	Message 46647 - Posted: 17 Apr 2022, 13:33:44 UTC - in response to Message 46627. I'm draining the job queue in anticipation of a WMAgent upgrade tomorrow. Jobs will probably start to run out around 1200GMT, so set No New Tasks late tonight or early in the morning. The update is done, and I've injected a new workflow. With any luck new tasks/jobs will be available within the next hour or so. ...and... we're off again! Thanks everyone. We believe in you and look forward to the opportunity to continue cooperation :-) Thanks. Unfortunately there is a problem at CERN this (holiday) weekend, and I've not been able to inject a new batch of jobs. Unless something changes soon we will run out of work in about six hours. Sorry 'bout that, but I don't see anything I can do to change it. I'll try to raise a trouble ticket. ID: 46647 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1006 Credit: 6,272,156 RAC: 374	Message 46648 - Posted: 17 Apr 2022, 13:53:05 UTC - in response to Message 46647. Ah, it looks like it might be a certificate expiry problem: ssl.SSLError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:618) 2022-04-17 15:50:23,368:INFO:inject-test-wfs: TC_SLC7.json request FAILED injection! ID: 46648 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1006 Credit: 6,272,156 RAC: 374	Message 46649 - Posted: 17 Apr 2022, 15:17:05 UTC - in response to Message 46647. https://cern.service-now.com/service-portal?id=ticket&table=incident&sys_id=b4c8b86787f24150eb3b33390cbb35a5&view=sp ID: 46649 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1006 Credit: 6,272,156 RAC: 374	Message 46650 - Posted: 19 Apr 2022, 0:41:59 UTC I've just succeeded in injecting a workflow. Jobs should be available again soon. ID: 46650 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1006 Credit: 6,272,156 RAC: 374	Message 46651 - Posted: 19 Apr 2022, 1:23:11 UTC - in response to Message 46650. We seem to have jobs running again: https://lhcathome.cern.ch/lhcathome/cms_job.php ID: 46651 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1006 Credit: 6,272,156 RAC: 374	Message 47248 - Posted: 13 Sep 2022, 11:00:07 UTC Sorry, I had an off day yesterday and we started running out of jobs this morning. More are being created and should be in the pipeline soon. ID: 47248 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1006 Credit: 6,272,156 RAC: 374	Message 47249 - Posted: 13 Sep 2022, 14:26:49 UTC I'm afraid we are going to have a little more disruption, to accommodate a WMAgent upgrade. I don't have the full details to hand yet, but we'll try to get it all done before the weekend. Keep an eye on your machines and the job plots, and be ready to set No New Tasks if necessary. ID: 47249 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1006 Credit: 6,272,156 RAC: 374	Message 47259 - Posted: 15 Sep 2022, 17:52:54 UTC - in response to Message 47249. I'm afraid we are going to have a little more disruption, to accommodate a WMAgent upgrade. I don't have the full details to hand yet, but we'll try to get it all done before the weekend. Keep an eye on your machines and the job plots, and be ready to set No New Tasks if necessary. OK, jobs are starting to become available again. Sorry about the delay, but there are some things I can't control. ID: 47259 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1006 Credit: 6,272,156 RAC: 374	Message 47262 - Posted: 17 Sep 2022, 12:13:40 UTC Hmm, spoke too soon. Job creation stopped early this morning -- it might have affected other production machines for a while, but they are up again now. We aren't, our queues are empty. Messages from WMAgent suggest a proxy certificate has expired, for us and at least one other Agent on cmsweb-testbed! (Why does that always happen on a weekend?) ID: 47262 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1006 Credit: 6,272,156 RAC: 374	Message 47266 - Posted: 19 Sep 2022, 11:27:38 UTC - in response to Message 47262. We're back. Someone waved the right chicken entrails and muttered the right curse, but they haven't owned up to it yet! ID: 47266 · Reply Quote

Magic Quantum Mechanic Send message Joined: 24 Oct 04 Posts: 1127 Credit: 49,750,513 RAC: 9,551	Message 47268 - Posted: 19 Sep 2022, 22:43:53 UTC - in response to Message 47266. Thanks Ivan, I will fire up my chicken cookers later tonight. ID: 47268 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1006 Credit: 6,272,156 RAC: 374	Message 47283 - Posted: 23 Sep 2022, 15:35:52 UTC - in response to Message 47266. We also had an outage due to a new python library that was leaving file handles open until a limit was reached (I think; it illustrates the problems of using third-party code!). I think we are all fixed again now. ID: 47283 · Reply Quote

LHC@home