Message boards : CMS Application : EXIT_NO_SUB_TASKS
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 14 · Next

AuthorMessage
Erich56

Send message
Joined: 18 Dec 15
Posts: 1491
Credit: 37,599,908
RAC: 47,123
Message 40332 - Posted: 30 Oct 2019, 9:02:51 UTC - in response to Message 40331.  

Tasks do nothing and end with error after 15-20 minutes.
which is typical for "no sub-tasks available"
ID: 40332 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 817
Credit: 5,717,880
RAC: 330
Message 40333 - Posted: 30 Oct 2019, 9:37:47 UTC - in response to Message 40332.  

Tasks do nothing and end with error after 15-20 minutes.
which is typical for "no sub-tasks available"

Yes, sorry about that. We seem to have a problem with queued jobs: there are two queues for each batch submitted, "queued" and "pending", both of which I believe have a 2,000-job limit. When I submit a batch, recently of 10,000 jobs each, jobs are created if necessary up to that number, to fill the "queued" queue; contemporaneously, jobs are moved from "queued" to "pending" until it also is full. In recent weeks, apparently since there was an interruption to the DNS service at CERN, there seems to have been a disruption in taking jobs from the "pending" queue and allocating them to worker machines -- only a small fraction get sent.
What happened last night was that the current batch's "queued" queue drained, and job allocation from the "pending" jobs dropped off (there are currently 1200 pending and 13 running; in the previous batch 232 are still pending and 36 running!). At CMS IT's suggestion, I've been playing around with batch priority but that's had no perceptible effect. I'll have to just make sure that I submit new batches before the "queued" queue drains -- there's a new batch on its way so things should pick up again soon.
I'll contact CERN again and suggest they restart the Condor scheduler.
ID: 40333 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jun 08
Posts: 1863
Credit: 127,778,094
RAC: 92,067
Message 40340 - Posted: 30 Oct 2019, 14:16:31 UTC - in response to Message 40333.  

Thanks.
It's running fine again.
ID: 40340 · Report as offensive     Reply Quote
Profile rbpeake

Send message
Joined: 17 Sep 04
Posts: 78
Credit: 24,957,558
RAC: 5,576
Message 40345 - Posted: 31 Oct 2019, 15:31:00 UTC - in response to Message 40333.  

Tasks do nothing and end with error after 15-20 minutes.
which is typical for "no sub-tasks available"

Yes, sorry about that. We seem to have a problem with queued jobs: there are two queues for each batch submitted, "queued" and "pending", both of which I believe have a 2,000-job limit. When I submit a batch, recently of 10,000 jobs each, jobs are created if necessary up to that number, to fill the "queued" queue; contemporaneously, jobs are moved from "queued" to "pending" until it also is full. In recent weeks, apparently since there was an interruption to the DNS service at CERN, there seems to have been a disruption in taking jobs from the "pending" queue and allocating them to worker machines -- only a small fraction get sent.
What happened last night was that the current batch's "queued" queue drained, and job allocation from the "pending" jobs dropped off (there are currently 1200 pending and 13 running; in the previous batch 232 are still pending and 36 running!). At CMS IT's suggestion, I've been playing around with batch priority but that's had no perceptible effect. I'll have to just make sure that I submit new batches before the "queued" queue drains -- there's a new batch on its way so things should pick up again soon.
I'll contact CERN again and suggest they restart the Condor scheduler.

At what point will CERN be able to generate CMS jobs themselves, so you would not be required to submit batches?
Regards,
Bob P.
ID: 40345 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 817
Credit: 5,717,880
RAC: 330
Message 40349 - Posted: 1 Nov 2019, 10:23:22 UTC - in response to Message 40345.  

At what point will CERN be able to generate CMS jobs themselves, so you would not be required to submit batches?

That's an ever-moving target... Every time we think we're close, the goal-line gets moved again. Six months ago, I would have said now. Now... :-(
ID: 40349 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jun 08
Posts: 1863
Credit: 127,778,094
RAC: 92,067
Message 40377 - Posted: 8 Nov 2019, 8:11:45 UTC

All tasks are failing with EXIT_NO_SUB_TASKS since 6:30 UTC this morning.
Just to make you aware.
ID: 40377 · Report as offensive     Reply Quote
NOGOOD

Send message
Joined: 18 Nov 17
Posts: 117
Credit: 38,007,470
RAC: 23,894
Message 40378 - Posted: 8 Nov 2019, 16:10:55 UTC - in response to Message 40377.  

Please, let us know, when we can run CMS again. Thank you.
ID: 40378 · Report as offensive     Reply Quote
Profile MAGIC Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 1001
Credit: 46,149,660
RAC: 6,849
Message 40379 - Posted: 8 Nov 2019, 19:41:18 UTC - in response to Message 40349.  

At what point will CERN be able to generate CMS jobs themselves, so you would not be required to submit batches?

That's an ever-moving target... Every time we think we're close, the goal-line gets moved again. Six months ago, I would have said now. Now... :-(


As you know we run version v49.00 over at LHC-dev (Average computing 58 GigaFLOPS) here (Average computing 312 GigaFLOPS) via Windows OS and the only problem I have had the last few days is that they need to get to HTCondor ping 0 in 15 minutes or less or they end up as one of the many different computer errors.......but once they do get running they work fine even with my ISP bird bath on a pole (satellite dish that runs like a 1995 dialup)

As far as the goal-line is it because we use the NFL version with a cross bar and uprights?
If it is maybe we should switch to that other version where they kick a round ball into a net (or maybe a FT line just to make it easier)

(ok I better get off here so that bird bath can d/l the new tasks in the near future)
ID: 40379 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jun 08
Posts: 1863
Credit: 127,778,094
RAC: 92,067
Message 40380 - Posted: 8 Nov 2019, 20:55:46 UTC - in response to Message 40377.  

All tasks are failing with EXIT_NO_SUB_TASKS since 6:30 UTC this morning.

Works fine again since late morning.
ID: 40380 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 817
Credit: 5,717,880
RAC: 330
Message 40388 - Posted: 10 Nov 2019, 13:37:34 UTC - in response to Message 40380.  

All tasks are failing with EXIT_NO_SUB_TASKS since 6:30 UTC this morning.

Works fine again since late morning.

There was a small hiatus switching between two batches of jobs. People at CERN are investigating but no news yet.
ID: 40388 · Report as offensive     Reply Quote
NOGOOD

Send message
Joined: 18 Nov 17
Posts: 117
Credit: 38,007,470
RAC: 23,894
Message 40416 - Posted: 13 Nov 2019, 5:58:11 UTC

All tasks are failing with EXIT_NO_SUB_TASKS again?
ID: 40416 · Report as offensive     Reply Quote
djoser
Avatar

Send message
Joined: 30 Aug 14
Posts: 145
Credit: 10,847,070
RAC: 0
Message 40418 - Posted: 13 Nov 2019, 7:17:25 UTC - in response to Message 40416.  

Maybe related to this!?
https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=5199
Why mine when you can research? - GRIDCOIN - Real cryptocurrency without wasting hashes! https://gridcoin.us
ID: 40418 · Report as offensive     Reply Quote
NOGOOD

Send message
Joined: 18 Nov 17
Posts: 117
Credit: 38,007,470
RAC: 23,894
Message 40419 - Posted: 13 Nov 2019, 7:41:39 UTC - in response to Message 40418.  

ID: 40419 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1491
Credit: 37,599,908
RAC: 47,123
Message 40581 - Posted: 22 Nov 2019, 4:17:18 UTC

207 (0x000000CF) EXIT_NO_SUB_TASKS

since about last midnight :-(
ID: 40581 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1491
Credit: 37,599,908
RAC: 47,123
Message 40585 - Posted: 22 Nov 2019, 8:27:52 UTC - in response to Message 40581.  

... since about last midnight :-(
what is also strange: was there not deployed a mechanism some time ago, which would stop sending out tasks once there is a problem with sub-tasks?
This obviously did not work this time :-(
ID: 40585 · Report as offensive     Reply Quote
Profile MAGIC Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 1001
Credit: 46,149,660
RAC: 6,849
Message 40586 - Posted: 22 Nov 2019, 8:56:58 UTC
Last modified: 22 Nov 2019, 8:57:42 UTC

We are having the same problem with these over at the CMS-dev
(well except mine are just giving me Exit status 1 (0x00000001) Unknown error code)

I suspended mine but I see 207 (0x000000CF) EXIT_NO_SUB_TASKS running there

It is friday so I hope we get this fixed before the end of the day.
ID: 40586 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 817
Credit: 5,717,880
RAC: 330
Message 40587 - Posted: 22 Nov 2019, 11:00:25 UTC - in response to Message 40586.  
Last modified: 22 Nov 2019, 11:03:19 UTC

We are having the same problem with these over at the CMS-dev
(well except mine are just giving me Exit status 1 (0x00000001) Unknown error code)

I suspended mine but I see 207 (0x000000CF) EXIT_NO_SUB_TASKS running there

It is friday so I hope we get this fixed before the end of the day.

There was a problem with the Oracle databases at CERN overnight, which stopped job submission. According to https://cern.service-now.com/service-portal/view-outage.do?n=OTG0053449 (if you can reach it) a workaround has been implemented. One of my machines is running tasks but still not getting jobs. Probably best to set No New Tasks until we can verify everything is working again.
[Added] Our WMAgent is down, with a database-connect error. I'll ask Alan to tickle it. [/Added]
ID: 40587 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 817
Credit: 5,717,880
RAC: 330
Message 40589 - Posted: 22 Nov 2019, 12:52:43 UTC - in response to Message 40587.  

OK, we've tickled the tiger's tail and jobs are available again. Have at it!
ID: 40589 · Report as offensive     Reply Quote
Profile MAGIC Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 1001
Credit: 46,149,660
RAC: 6,849
Message 40597 - Posted: 22 Nov 2019, 17:37:00 UTC
Last modified: 22 Nov 2019, 17:41:15 UTC

I just gave one a try and still no luck and I don't see any new Valids anywhere Ivan.
(I will give it another try later)
ID: 40597 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1491
Credit: 37,599,908
RAC: 47,123
Message 40598 - Posted: 22 Nov 2019, 17:57:39 UTC - in response to Message 40597.  

I'm having a few ones running for about 4 hours now, so far successful.
ID: 40598 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 14 · Next

Message boards : CMS Application : EXIT_NO_SUB_TASKS


©2022 CERN