Message boards : CMS Application : EXIT_NO_SUB_TASKS
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · Next

AuthorMessage
computezrmle
Avatar

Send message
Joined: 15 Jun 08
Posts: 1133
Credit: 55,660,558
RAC: 105,439
Message 39425 - Posted: 23 Jul 2019, 13:21:07 UTC

Situation after the hypervisor reboot.

https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=5087&postid=39424

Tasks are still failing with EXIT_NO_SUB_TASKS.
Looks like not only my hosts are affected.
ID: 39425 · Report as offensive     Reply Quote
gyllic

Send message
Joined: 9 Dec 14
Posts: 201
Credit: 2,500,279
RAC: 651
Message 39426 - Posted: 23 Jul 2019, 22:09:16 UTC - in response to Message 39425.  

Looks like not only my hosts are affected.
Indeed, same here.
ID: 39426 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 621
Credit: 4,682,691
RAC: 1,737
Message 39427 - Posted: 24 Jul 2019, 12:23:35 UTC - in response to Message 39425.  

Situation after the hypervisor reboot.

https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=5087&postid=39424

Tasks are still failing with EXIT_NO_SUB_TASKS.
Looks like not only my hosts are affected.

Yes, we are not picking up jobs from condor, even though plenty are pending. On the one hand:
2-3846-14590.2-3846-14590: Run analysis summary of 1 jobs.
    1 (100.00 %) match both slot and job requirements.
    1 match the requirements of this slot.
    1 have job requirements that match this slot.
but on the other:
179751.002:  Run analysis summary ignoring user priority.  Of 1 machines,
      0 are rejected by your job's requirements
      0 reject your job because of their own requirements
      0 match and are already running your jobs
      0 match but are serving other users
      0 are available to run your job
I've put out a request for help.
ID: 39427 · Report as offensive     Reply Quote
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 561
Credit: 349,074,620
RAC: 535,630
Message 39460 - Posted: 29 Jul 2019, 19:55:44 UTC

Any news, still plenty of fails for me
ID: 39460 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 621
Credit: 4,682,691
RAC: 1,737
Message 39479 - Posted: 1 Aug 2019, 16:13:05 UTC - in response to Message 39460.  

OK, Federica managed to find the server which needed to be rebooted and jobs are starting to flow again. Thanks for your patience.
ID: 39479 · Report as offensive     Reply Quote
gyllic

Send message
Joined: 9 Dec 14
Posts: 201
Credit: 2,500,279
RAC: 651
Message 39480 - Posted: 1 Aug 2019, 21:06:42 UTC - in response to Message 39479.  

OK, Federica managed to find the server which needed to be rebooted and jobs are starting to flow again. Thanks for your patience.
Looking good so far!
ID: 39480 · Report as offensive     Reply Quote
computezrmle
Avatar

Send message
Joined: 15 Jun 08
Posts: 1133
Credit: 55,660,558
RAC: 105,439
Message 39481 - Posted: 2 Aug 2019, 7:15:13 UTC - in response to Message 39479.  

Yes, works fine since yesterday evening.
Thanks.
ID: 39481 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 621
Credit: 4,682,691
RAC: 1,737
Message 39488 - Posted: 3 Aug 2019, 9:50:36 UTC
Last modified: 3 Aug 2019, 9:51:40 UTC

Just when you think it's sailing along OK... System Infrastructure want to make some changes to our submission scheme, in preparation for when they take over job submission. So, we need to drain the queue. This will take, I think, several days at the current rate. Keep an eye on your task exit status and be ready to set No New Tasks when you see them failing. I will, of course, post a warning if I can but I only have Internet access at work at the moment.
ID: 39488 · Report as offensive     Reply Quote
computezrmle
Avatar

Send message
Joined: 15 Jun 08
Posts: 1133
Credit: 55,660,558
RAC: 105,439
Message 39489 - Posted: 3 Aug 2019, 10:18:29 UTC - in response to Message 39488.  

System Infrastructure want to make some changes ...

Will this affect settings on volunteer's side, e.g. server names, firewall ports, etc.?
ID: 39489 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 621
Credit: 4,682,691
RAC: 1,737
Message 39495 - Posted: 5 Aug 2019, 7:54:43 UTC - in response to Message 39489.  

System Infrastructure want to make some changes ...

Will this affect settings on volunteer's side, e.g. server names, firewall ports, etc.?

No, just how jobs are apportioned to T3_CH_Volunteer, as far as I know. The condor server and collector will remain the same.
ID: 39495 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 621
Credit: 4,682,691
RAC: 1,737
Message 39499 - Posted: 6 Aug 2019, 8:46:47 UTC - in response to Message 39488.  

Just when you think it's sailing along OK... System Infrastructure want to make some changes to our submission scheme, in preparation for when they take over job submission. So, we need to drain the queue. This will take, I think, several days at the current rate. Keep an eye on your task exit status and be ready to set No New Tasks when you see them failing. I will, of course, post a warning if I can but I only have Internet access at work at the moment.

I cranked up two of my servers to use the full 40 cores... There are 1700 jobs still to finish and we're retiring about 60/hr so I make that 30 hrs to go; WMStats estimates 22 hours.
ID: 39499 · Report as offensive     Reply Quote
computezrmle
Avatar

Send message
Joined: 15 Jun 08
Posts: 1133
Credit: 55,660,558
RAC: 105,439
Message 39505 - Posted: 7 Aug 2019, 7:29:57 UTC

Looks like the subtask queue is empty.
Time to set CMS to NNT and wait for Ivan's go to reactivate it.
ID: 39505 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 621
Credit: 4,682,691
RAC: 1,737
Message 39506 - Posted: 7 Aug 2019, 8:16:46 UTC - in response to Message 39505.  

Looks like the subtask queue is empty.
Time to set CMS to NNT and wait for Ivan's go to reactivate it.

Yes, please do set NNT. I'll alert CERN that the queue is almost empty (down to 99 running); they may pull the plug if some jobs persist.
ID: 39506 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 621
Credit: 4,682,691
RAC: 1,737
Message 39509 - Posted: 7 Aug 2019, 11:02:29 UTC

There's been a slight change in plans.
"Given that we do not need to redeploy the agent, but only kill jobs in condor and let them get recreated with the JobSubmitter/schedd changes, I think you can go ahead and submit another workflow to [keep] volunteers happy."
So, I'll continue to submit smaller batches and you can resume new tasks.
ID: 39509 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 621
Credit: 4,682,691
RAC: 1,737
Message 39575 - Posted: 11 Aug 2019, 10:27:02 UTC

It'd help now if you turned on No New Tasks for the next day or so, so that we can run down all the queues ready for an intervention tomorrow night. I'll check again tonight if I can, and top up the jobs if necessary.
ID: 39575 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 621
Credit: 4,682,691
RAC: 1,737
Message 39582 - Posted: 12 Aug 2019, 15:58:50 UTC

OK, the change has been done and I'm injecting a new (small) workflow. It'll take a little while to see if things have started up again successfully.
ID: 39582 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 621
Credit: 4,682,691
RAC: 1,737
Message 39583 - Posted: 12 Aug 2019, 16:53:12 UTC

Things look OK, jobs have started again. I'm waiting on word from the US as to whether it was successful (it seems so, but I'm not the expert).
ID: 39583 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 621
Credit: 4,682,691
RAC: 1,737
Message 39587 - Posted: 12 Aug 2019, 17:42:47 UTC - in response to Message 39583.  

Things look OK, jobs have started again. I'm waiting on word from the US as to whether it was successful (it seems so, but I'm not the expert).

They say that things are looking good. I've submitted more jobs and am off to my internet-deficient temporary digs. Hope things still look rosy tomorrow...
ID: 39587 · Report as offensive     Reply Quote
Luigi R.
Avatar

Send message
Joined: 7 Feb 14
Posts: 58
Credit: 4,179,534
RAC: 27,230
Message 39622 - Posted: 16 Aug 2019, 10:22:13 UTC

I have a CMS task that has presumably been at idle for 12 hours.
I was going to abort it, but it finished while writing this post.

https://lhcathome.cern.ch/lhcathome/result.php?resultid=240148745
ID: 39622 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1125
Credit: 21,656,646
RAC: 34,223
Message 39646 - Posted: 19 Aug 2019, 6:10:48 UTC - in response to Message 39622.  

since a few hours ago, all CMS tasks fail with:

207 (0x000000CF) EXIT_NO_SUB_TASKS
ID: 39646 · Report as offensive     Reply Quote
1 · 2 · 3 · Next

Message boards : CMS Application : EXIT_NO_SUB_TASKS


©2019 CERN