Thread 'NO_SUB_TASKS for Theory'

Author	Message
computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2725 Credit: 300,201,442 RAC: 41,799	Message 34416 - Posted: 19 Feb 2018, 13:34:14 UTC Theory has an unusual high error rate since yesterday: EXIT_NO_SUB_TASKS Will this stabilise at short notice or shall Theory be set to NNT? ID: 34416 · Reply Quote

Jim1348 Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0	Message 34418 - Posted: 19 Feb 2018, 16:30:34 UTC - in response to Message 34416. I had the same problem until a few hours ago. They are all OK now. ID: 34418 · Reply Quote

Ray Murray Volunteer moderator Send message Joined: 29 Sep 04 Posts: 281 Credit: 11,888,115 RAC: 0	Message 34897 - Posted: 6 Apr 2018, 20:59:30 UTC - in response to Message 34416. Last modified: 6 Apr 2018, 21:02:42 UTC And again today since 1300ish UTC or at least that's when my first Theory errored having finished the last of my Sixtracks. ID: 34897 · Reply Quote

Jim1348 Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0	Message 34898 - Posted: 6 Apr 2018, 22:55:10 UTC I am now getting "Guest Log: [ERROR] No jobs were available to run." on both Theory and LHCb. I can only run CMS at the moment, since I don't have ATLAS selected, and they are not sending Sixtrack now. ID: 34898 · Reply Quote

Magic Quantum Mechanic Send message Joined: 24 Oct 04 Posts: 1294 Credit: 95,320,830 RAC: 26,283	Message 34900 - Posted: 7 Apr 2018, 4:44:52 UTC Last modified: 7 Apr 2018, 4:45:37 UTC Today was a Theory task disaster. I stopped counting my Server Error tasks at 120 and there would be more if I left the 8 core pc's running able to get new tasks so they just lost all the ones each pc had loaded.....but I do have my quad-core pc's running set to get new tasks all the time so I imagine they got several of those Server Errors Good thing I had one of my 8-cores running CMS since it just had Valids all day. I surprised the evil server even gave me new tasks after I checked and saw all of mine were gone and the pc's were just sitting there doing nothing. Volunteer Mad Scientist For Life unbelievable are you trying to promote linux again? ID: 34900 · Reply Quote

Ben Segal Volunteer moderator Project administrator Send message Joined: 1 Sep 04 Posts: 143 Credit: 2,579 RAC: 0	Message 34901 - Posted: 7 Apr 2018, 7:28:59 UTC - in response to Message 34900. Last modified: 7 Apr 2018, 7:30:56 UTC Seems to be running again since a while. We have had all sorts of CERN problems following the major network collapse two days ago. Apologies to all! ID: 34901 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1971 Credit: 159,545,326 RAC: 47,883	Message 34904 - Posted: 7 Apr 2018, 13:09:20 UTC - in response to Message 34901. Seems to be running again since a while. no, definitely NOT. All my tasks errored out after about 18 minutes with "207 (0x000000CF) EXIT_NO_SUB_TASKS" :-((( I am wondering that no one back there has noticed this and tried to rectify the problem. ID: 34904 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1540 Credit: 10,049,638 RAC: 1,414	Message 36894 - Posted: 26 Sep 2018, 16:14:45 UTC All my recent tasks ending in EXIT_NO_SUB_TASKS -> https://lhcathome.cern.ch/lhcathome/results.php?hostid=10360630&offset=0&show_names=0&state=6&appid=13 ID: 36894 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1971 Credit: 159,545,326 RAC: 47,883	Message 36895 - Posted: 26 Sep 2018, 16:36:41 UTC - in response to Message 36894. same here since this afternoon. This seems to be the same problem which we had a few days ago. There was said to be some mechanism which stops task production once no jobs are available. Obviously, this does not work either. ID: 36895 · Reply Quote

bronco Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0	Message 36897 - Posted: 26 Sep 2018, 17:11:13 UTC - in response to Message 36895. Last modified: 26 Sep 2018, 17:12:52 UTC The starter and running logs for Theory tasks currently running on my host shows they absolutely are getting jobs. I can even see the number of processed events incrementing in the running log. From those observations the only sensible explanation is: 1) the tasks are indeed receiving jobs 2) the jobs are progressing (processing events) normally 3) the NO_SUB_TASKS ERROR is itself an error, in other words it is being generated erroneously ID: 36897 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1971 Credit: 159,545,326 RAC: 47,883	Message 36898 - Posted: 26 Sep 2018, 18:12:32 UTC - in response to Message 36897. 1) the tasks are indeed receiving jobs on all my PCs on which I was running Theory, this was definitely NOT the case. Excerpt from stderr: 2018-09-26 18:23:33 (8640): VM Completion Message: No jobs were available to run The complete stderr can be seen here: https://lhcathome.cern.ch/lhcathome/result.php?resultid=207211751 or https://lhcathome.cern.ch/lhcathome/result.php?resultid=207209823 or https://lhcathome.cern.ch/lhcathome/result.php?resultid=207212463 ID: 36898 · Reply Quote

bronco Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0	Message 36899 - Posted: 26 Sep 2018, 18:24:20 UTC - in response to Message 36898. The VM is easily fooled. Did you check the running logs or the starter logs? They are not so easily fooled. ID: 36899 · Reply Quote

Magic Quantum Mechanic Send message Joined: 24 Oct 04 Posts: 1294 Credit: 95,320,830 RAC: 26,283	Message 36900 - Posted: 26 Sep 2018, 18:53:50 UTC We do have some of these tasks that have jobs and finish Valid but we also have twice as many or more with no jobs and just ending up a major waste of time. https://lhcathome.cern.ch/lhcathome/hosts_user.php?userid=129087 Many examples just today on the hosts running Theory tasks. ID: 36900 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1540 Credit: 10,049,638 RAC: 1,414	Message 36902 - Posted: 26 Sep 2018, 19:26:01 UTC - in response to Message 36897. Last modified: 26 Sep 2018, 19:27:23 UTC ... 3) the NO_SUB_TASKS ERROR is itself an error, in other words it is being generated erroneously Thanks bronco for thinking with us, but point 3 is very unlikely, so another reason should be there like - low number of jobs and adding new ones could be too slow, so sometimes you get one and sometimes not or - getting jobs from different servers, where one could be not functioning well. Btw: your valid tasks don't reach the 12 hours elapsed time and have been killed early because of no new jobs. On your machine 10541232 all Theory tasks of 23 Sep are valid, but have run too short. Even the 'normal' 10 minutes wait for Condor is not reached. Had you killed those jobs yourself to get a valid instead of an invalid task? Btw2: I've 2 VM's now with running jobs. ID: 36902 · Reply Quote

bronco Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0	Message 36903 - Posted: 26 Sep 2018, 20:16:55 UTC - in response to Message 36902. I was waiting for someone to mention the short runs and valids, you are very observant :) The short runs you are seeing are shutdown gracefully via 1 of 2 mechanisms in my babysitter script: manual or automatic. My babysitter script has grown into GUI app. Each running Theory, LHCB and CMS task has an associated clickable toggle button which when clicked raises/lowers a flag that causes the script to gracefully shutdown the task when the current job completes. I use that button rarely. It's handy for ending tasks relatively quickly so I can install OS updates, do a reboot,. etc. I also use it when an LHC application is not running the way it shouild, like in recent days for example, to run more than the usual number of tasks so I can make more observations of how they start and finish. That's the manual method. Automatic methods are: 1) Task doesn't get a job within 10 minutes. I was getting rather sick and tired of tasks running for 10 hours and not processing even 1 event. Though there was mention of such a mechanism built into the tasks it seemed like it was very unreliable 2) Detects looping job (usually a Sherpa) 3) Extrapolation from the time required to process the current number of processed events indicates the job will not complete before the 18 hour task limit causes graceful shutdown. 4) Job starts after the 10 hour task mark. Jobs started before the 10 hour mark are subject to graceful shutdown for 3) above. I don't end them gracefully just to get a valid and some credits. I couldn't care less about the credits or my valid:invalid ratio. ID: 36903 · Reply Quote

bronco Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0	Message 36917 - Posted: 28 Sep 2018, 12:18:19 UTC They're back. ID: 36917 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1971 Credit: 159,545,326 RAC: 47,883	Message 36920 - Posted: 29 Sep 2018, 4:19:42 UTC - in response to Message 36917. from what the list of tasks shows me this morning: yesterday evening, there was again quite a number of tasks which did not receive any jobs: 207 (0x000000CF) EXIT_NO_SUB_TASKS ID: 36920 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1540 Credit: 10,049,638 RAC: 1,414	Message 36921 - Posted: 29 Sep 2018, 7:14:49 UTC - in response to Message 36920. from what the list of tasks shows me this morning: yesterday evening, there was again quite a number of tasks which did not receive any jobs: 207 (0x000000CF) EXIT_NO_SUB_TASKS My morning is different: Four new tasks all getting jobs. ID: 36921 · Reply Quote

Ray Murray Volunteer moderator Send message Joined: 29 Sep 04 Posts: 281 Credit: 11,888,115 RAC: 0	Message 36928 - Posted: 30 Sep 2018, 20:46:24 UTC Not sure if it's significant but I'm mention it again anyway; All NEW tasks that have started recently (since c.19:00 UTC) have been unable to get jobs but those that have been running since before the blockage seem to be getting new jobs. An already connected VM will get jobs but a NEW VM is unable to make that connection. (I've basically said the same thing twice there but it made it clearer in my head) ID: 36928 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1971 Credit: 159,545,326 RAC: 47,883	Message 36929 - Posted: 1 Oct 2018, 3:22:56 UTC - in response to Message 36928. like a day before, also yesterday evening all new tasks got NO jobs (between around 20:30 and 22:30 UTC). Seems to be some kind of pattern now. ID: 36929 · Reply Quote