Message boards : Theory Application : NO_SUB_TASKS for Theory
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,939,536
RAC: 137,496
Message 34416 - Posted: 19 Feb 2018, 13:34:14 UTC

Theory has an unusual high error rate since yesterday:
EXIT_NO_SUB_TASKS

Will this stabilise at short notice or shall Theory be set to NNT?
ID: 34416 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Nov 14
Posts: 602
Credit: 24,371,321
RAC: 0
Message 34418 - Posted: 19 Feb 2018, 16:30:34 UTC - in response to Message 34416.  

I had the same problem until a few hours ago. They are all OK now.
ID: 34418 · Report as offensive     Reply Quote
Profile Ray Murray
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 281
Credit: 11,859,285
RAC: 1
Message 34897 - Posted: 6 Apr 2018, 20:59:30 UTC - in response to Message 34416.  
Last modified: 6 Apr 2018, 21:02:42 UTC

And again today since 1300ish UTC
or at least that's when my first Theory errored having finished the last of my Sixtracks.
ID: 34897 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Nov 14
Posts: 602
Credit: 24,371,321
RAC: 0
Message 34898 - Posted: 6 Apr 2018, 22:55:10 UTC

I am now getting "Guest Log: [ERROR] No jobs were available to run." on both Theory and LHCb.
I can only run CMS at the moment, since I don't have ATLAS selected, and they are not sending Sixtrack now.
ID: 34898 · Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 1114
Credit: 49,501,728
RAC: 4,157
Message 34900 - Posted: 7 Apr 2018, 4:44:52 UTC
Last modified: 7 Apr 2018, 4:45:37 UTC

Today was a Theory task disaster.

I stopped counting my *Server Error* tasks at 120 and there would be more if I left the 8 core pc's running able to get new tasks so they just lost all the ones each pc had loaded.....but I do have my quad-core pc's running set to get new tasks all the time so I imagine they got several of those *Server Errors*

Good thing I had one of my 8-cores running CMS since it just had Valids all day.

I surprised the evil server even gave me new tasks after I checked and saw all of mine were gone and the pc's were just sitting there doing nothing.
Volunteer Mad Scientist For Life
ID: 34900 · Report as offensive     Reply Quote
Profile Ben Segal
Volunteer moderator
Project administrator

Send message
Joined: 1 Sep 04
Posts: 139
Credit: 2,579
RAC: 0
Message 34901 - Posted: 7 Apr 2018, 7:28:59 UTC - in response to Message 34900.  
Last modified: 7 Apr 2018, 7:30:56 UTC

Seems to be running again since a while. We have had all sorts of CERN problems following the major network collapse two days ago. Apologies to all!
ID: 34901 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,379,507
RAC: 102,089
Message 34904 - Posted: 7 Apr 2018, 13:09:20 UTC - in response to Message 34901.  

Seems to be running again since a while.
no, definitely NOT. All my tasks errored out after about 18 minutes with "207 (0x000000CF) EXIT_NO_SUB_TASKS" :-(((
I am wondering that no one back there has noticed this and tried to rectify the problem.
ID: 34904 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1268
Credit: 8,421,616
RAC: 2,139
Message 36894 - Posted: 26 Sep 2018, 16:14:45 UTC

ID: 36894 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,379,507
RAC: 102,089
Message 36895 - Posted: 26 Sep 2018, 16:36:41 UTC - in response to Message 36894.  

same here since this afternoon.

This seems to be the same problem which we had a few days ago. There was said to be some mechanism which stops task production once no jobs are available. Obviously, this does not work either.
ID: 36895 · Report as offensive     Reply Quote
bronco

Send message
Joined: 13 Apr 18
Posts: 443
Credit: 8,438,885
RAC: 0
Message 36897 - Posted: 26 Sep 2018, 17:11:13 UTC - in response to Message 36895.  
Last modified: 26 Sep 2018, 17:12:52 UTC

The starter and running logs for Theory tasks currently running on my host shows they absolutely are getting jobs. I can even see the number of processed events incrementing in the running log. From those observations the only sensible explanation is:

1) the tasks are indeed receiving jobs
2) the jobs are progressing (processing events) normally
3) the NO_SUB_TASKS ERROR is itself an error, in other words it is being generated erroneously
ID: 36897 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,379,507
RAC: 102,089
Message 36898 - Posted: 26 Sep 2018, 18:12:32 UTC - in response to Message 36897.  

1) the tasks are indeed receiving jobs
on all my PCs on which I was running Theory, this was definitely NOT the case.
Excerpt from stderr:
2018-09-26 18:23:33 (8640): VM Completion Message: No jobs were available to run

The complete stderr can be seen here:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=207211751
or
https://lhcathome.cern.ch/lhcathome/result.php?resultid=207209823
or
https://lhcathome.cern.ch/lhcathome/result.php?resultid=207212463
ID: 36898 · Report as offensive     Reply Quote
bronco

Send message
Joined: 13 Apr 18
Posts: 443
Credit: 8,438,885
RAC: 0
Message 36899 - Posted: 26 Sep 2018, 18:24:20 UTC - in response to Message 36898.  

The VM is easily fooled.
Did you check the running logs or the starter logs? They are not so easily fooled.
ID: 36899 · Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 1114
Credit: 49,501,728
RAC: 4,157
Message 36900 - Posted: 26 Sep 2018, 18:53:50 UTC

We do have some of these tasks that have jobs and finish Valid but we also have twice as many or more with no jobs and just ending up a major waste of time.

https://lhcathome.cern.ch/lhcathome/hosts_user.php?userid=129087

Many examples just today on the hosts running Theory tasks.
ID: 36900 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1268
Credit: 8,421,616
RAC: 2,139
Message 36902 - Posted: 26 Sep 2018, 19:26:01 UTC - in response to Message 36897.  
Last modified: 26 Sep 2018, 19:27:23 UTC

...
3) the NO_SUB_TASKS ERROR is itself an error, in other words it is being generated erroneously
Thanks bronco for thinking with us, but point 3 is very unlikely, so another reason should be there like
- low number of jobs and adding new ones could be too slow, so sometimes you get one and sometimes not or
- getting jobs from different servers, where one could be not functioning well.

Btw: your valid tasks don't reach the 12 hours elapsed time and have been killed early because of no new jobs.
On your machine 10541232 all Theory tasks of 23 Sep are valid, but have run too short. Even the 'normal' 10 minutes wait for Condor is not reached.
Had you killed those jobs yourself to get a valid instead of an invalid task?

Btw2: I've 2 VM's now with running jobs.
ID: 36902 · Report as offensive     Reply Quote
bronco

Send message
Joined: 13 Apr 18
Posts: 443
Credit: 8,438,885
RAC: 0
Message 36903 - Posted: 26 Sep 2018, 20:16:55 UTC - in response to Message 36902.  

I was waiting for someone to mention the short runs and valids, you are very observant :)
The short runs you are seeing are shutdown gracefully via 1 of 2 mechanisms in my babysitter script: manual or automatic.
My babysitter script has grown into GUI app. Each running Theory, LHCB and CMS task has an associated clickable toggle button which when clicked raises/lowers a flag that causes the script to gracefully shutdown the task when the current job completes. I use that button rarely. It's handy for ending tasks relatively quickly so I can install OS updates, do a reboot,. etc. I also use it when an LHC application is not running the way it shouild, like in recent days for example, to run more than the usual number of tasks so I can make more observations of how they start and finish. That's the manual method.

Automatic methods are:
1) Task doesn't get a job within 10 minutes. I was getting rather sick and tired of tasks running for 10 hours and not processing even 1 event. Though there was mention of such a mechanism built into the tasks it seemed like it was very unreliable

2) Detects looping job (usually a Sherpa)

3) Extrapolation from the time required to process the current number of processed events indicates the job will not complete before the 18 hour task limit causes graceful shutdown.

4) Job starts after the 10 hour task mark. Jobs started before the 10 hour mark are subject to graceful shutdown for 3) above.

I don't end them gracefully just to get a valid and some credits. I couldn't care less about the credits or my valid:invalid ratio.
ID: 36903 · Report as offensive     Reply Quote
bronco

Send message
Joined: 13 Apr 18
Posts: 443
Credit: 8,438,885
RAC: 0
Message 36917 - Posted: 28 Sep 2018, 12:18:19 UTC

They're back.
ID: 36917 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,379,507
RAC: 102,089
Message 36920 - Posted: 29 Sep 2018, 4:19:42 UTC - in response to Message 36917.  

from what the list of tasks shows me this morning: yesterday evening, there was again quite a number of tasks which did not receive any jobs:

207 (0x000000CF) EXIT_NO_SUB_TASKS
ID: 36920 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1268
Credit: 8,421,616
RAC: 2,139
Message 36921 - Posted: 29 Sep 2018, 7:14:49 UTC - in response to Message 36920.  

from what the list of tasks shows me this morning: yesterday evening, there was again quite a number of tasks which did not receive any jobs:

207 (0x000000CF) EXIT_NO_SUB_TASKS
My morning is different: Four new tasks all getting jobs.
ID: 36921 · Report as offensive     Reply Quote
Profile Ray Murray
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 281
Credit: 11,859,285
RAC: 1
Message 36928 - Posted: 30 Sep 2018, 20:46:24 UTC

Not sure if it's significant but I'm mention it again anyway;
All NEW tasks that have started recently (since c.19:00 UTC) have been unable to get jobs but those that have been running since before the blockage seem to be getting new jobs.
An already connected VM will get jobs but a NEW VM is unable to make that connection.
(I've basically said the same thing twice there but it made it clearer in my head)
ID: 36928 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,379,507
RAC: 102,089
Message 36929 - Posted: 1 Oct 2018, 3:22:56 UTC - in response to Message 36928.  

like a day before, also yesterday evening all new tasks got NO jobs (between around 20:30 and 22:30 UTC). Seems to be some kind of pattern now.
ID: 36929 · Report as offensive     Reply Quote
1 · 2 · Next

Message boards : Theory Application : NO_SUB_TASKS for Theory


©2024 CERN