Message boards :
Theory Application :
NO_SUB_TASKS for Theory
Message board moderation
Author | Message |
---|---|
Send message Joined: 15 Jun 08 Posts: 2413 Credit: 226,502,518 RAC: 131,914 |
Theory has an unusual high error rate since yesterday: EXIT_NO_SUB_TASKS Will this stabilise at short notice or shall Theory be set to NNT? |
Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0 |
I had the same problem until a few hours ago. They are all OK now. |
Send message Joined: 29 Sep 04 Posts: 281 Credit: 11,859,285 RAC: 0 |
And again today since 1300ish UTC or at least that's when my first Theory errored having finished the last of my Sixtracks. |
Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0 |
I am now getting "Guest Log: [ERROR] No jobs were available to run." on both Theory and LHCb. I can only run CMS at the moment, since I don't have ATLAS selected, and they are not sending Sixtrack now. |
Send message Joined: 24 Oct 04 Posts: 1127 Credit: 49,752,900 RAC: 8,696 |
Today was a Theory task disaster. I stopped counting my *Server Error* tasks at 120 and there would be more if I left the 8 core pc's running able to get new tasks so they just lost all the ones each pc had loaded.....but I do have my quad-core pc's running set to get new tasks all the time so I imagine they got several of those *Server Errors* Good thing I had one of my 8-cores running CMS since it just had Valids all day. I surprised the evil server even gave me new tasks after I checked and saw all of mine were gone and the pc's were just sitting there doing nothing. Volunteer Mad Scientist For Life |
Send message Joined: 1 Sep 04 Posts: 139 Credit: 2,579 RAC: 0 |
Seems to be running again since a while. We have had all sorts of CERN problems following the major network collapse two days ago. Apologies to all! |
Send message Joined: 18 Dec 15 Posts: 1689 Credit: 103,909,318 RAC: 121,840 |
Seems to be running again since a while.no, definitely NOT. All my tasks errored out after about 18 minutes with "207 (0x000000CF) EXIT_NO_SUB_TASKS" :-((( I am wondering that no one back there has noticed this and tried to rectify the problem. |
Send message Joined: 14 Jan 10 Posts: 1280 Credit: 8,496,817 RAC: 2,374 |
All my recent tasks ending in EXIT_NO_SUB_TASKS -> https://lhcathome.cern.ch/lhcathome/results.php?hostid=10360630&offset=0&show_names=0&state=6&appid=13 |
Send message Joined: 18 Dec 15 Posts: 1689 Credit: 103,909,318 RAC: 121,840 |
same here since this afternoon. This seems to be the same problem which we had a few days ago. There was said to be some mechanism which stops task production once no jobs are available. Obviously, this does not work either. |
Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0 |
The starter and running logs for Theory tasks currently running on my host shows they absolutely are getting jobs. I can even see the number of processed events incrementing in the running log. From those observations the only sensible explanation is: 1) the tasks are indeed receiving jobs 2) the jobs are progressing (processing events) normally 3) the NO_SUB_TASKS ERROR is itself an error, in other words it is being generated erroneously |
Send message Joined: 18 Dec 15 Posts: 1689 Credit: 103,909,318 RAC: 121,840 |
1) the tasks are indeed receiving jobson all my PCs on which I was running Theory, this was definitely NOT the case. Excerpt from stderr: 2018-09-26 18:23:33 (8640): VM Completion Message: No jobs were available to run The complete stderr can be seen here: https://lhcathome.cern.ch/lhcathome/result.php?resultid=207211751 or https://lhcathome.cern.ch/lhcathome/result.php?resultid=207209823 or https://lhcathome.cern.ch/lhcathome/result.php?resultid=207212463 |
Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0 |
The VM is easily fooled. Did you check the running logs or the starter logs? They are not so easily fooled. |
Send message Joined: 24 Oct 04 Posts: 1127 Credit: 49,752,900 RAC: 8,696 |
We do have some of these tasks that have jobs and finish Valid but we also have twice as many or more with no jobs and just ending up a major waste of time. https://lhcathome.cern.ch/lhcathome/hosts_user.php?userid=129087 Many examples just today on the hosts running Theory tasks. |
Send message Joined: 14 Jan 10 Posts: 1280 Credit: 8,496,817 RAC: 2,374 |
...Thanks bronco for thinking with us, but point 3 is very unlikely, so another reason should be there like - low number of jobs and adding new ones could be too slow, so sometimes you get one and sometimes not or - getting jobs from different servers, where one could be not functioning well. Btw: your valid tasks don't reach the 12 hours elapsed time and have been killed early because of no new jobs. On your machine 10541232 all Theory tasks of 23 Sep are valid, but have run too short. Even the 'normal' 10 minutes wait for Condor is not reached. Had you killed those jobs yourself to get a valid instead of an invalid task? Btw2: I've 2 VM's now with running jobs. |
Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0 |
I was waiting for someone to mention the short runs and valids, you are very observant :) The short runs you are seeing are shutdown gracefully via 1 of 2 mechanisms in my babysitter script: manual or automatic. My babysitter script has grown into GUI app. Each running Theory, LHCB and CMS task has an associated clickable toggle button which when clicked raises/lowers a flag that causes the script to gracefully shutdown the task when the current job completes. I use that button rarely. It's handy for ending tasks relatively quickly so I can install OS updates, do a reboot,. etc. I also use it when an LHC application is not running the way it shouild, like in recent days for example, to run more than the usual number of tasks so I can make more observations of how they start and finish. That's the manual method. Automatic methods are: 1) Task doesn't get a job within 10 minutes. I was getting rather sick and tired of tasks running for 10 hours and not processing even 1 event. Though there was mention of such a mechanism built into the tasks it seemed like it was very unreliable 2) Detects looping job (usually a Sherpa) 3) Extrapolation from the time required to process the current number of processed events indicates the job will not complete before the 18 hour task limit causes graceful shutdown. 4) Job starts after the 10 hour task mark. Jobs started before the 10 hour mark are subject to graceful shutdown for 3) above. I don't end them gracefully just to get a valid and some credits. I couldn't care less about the credits or my valid:invalid ratio. |
Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0 |
They're back. |
Send message Joined: 18 Dec 15 Posts: 1689 Credit: 103,909,318 RAC: 121,840 |
from what the list of tasks shows me this morning: yesterday evening, there was again quite a number of tasks which did not receive any jobs: 207 (0x000000CF) EXIT_NO_SUB_TASKS |
Send message Joined: 14 Jan 10 Posts: 1280 Credit: 8,496,817 RAC: 2,374 |
from what the list of tasks shows me this morning: yesterday evening, there was again quite a number of tasks which did not receive any jobs:My morning is different: Four new tasks all getting jobs. |
Send message Joined: 29 Sep 04 Posts: 281 Credit: 11,859,285 RAC: 0 |
Not sure if it's significant but I'm mention it again anyway; All NEW tasks that have started recently (since c.19:00 UTC) have been unable to get jobs but those that have been running since before the blockage seem to be getting new jobs. An already connected VM will get jobs but a NEW VM is unable to make that connection. (I've basically said the same thing twice there but it made it clearer in my head) |
Send message Joined: 18 Dec 15 Posts: 1689 Credit: 103,909,318 RAC: 121,840 |
like a day before, also yesterday evening all new tasks got NO jobs (between around 20:30 and 22:30 UTC). Seems to be some kind of pattern now. |
©2024 CERN