Thread '207 (0x000000CF) EXIT_NO_SUB

Author	Message
djoser Send message Joined: 30 Aug 14 Posts: 145 Credit: 10,847,070 RAC: 0	Message 37169 - Posted: 2 Nov 2018, 14:33:06 UTC Hi there! Since a few minutes ago all LHCb tasks fail with error code 207! Regards, djoser. Why mine when you can research? - GRIDCOIN - Real cryptocurrency without wasting hashes! https://gridcoin.us ID: 37169 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1984 Credit: 161,886,196 RAC: 81,830	Message 37173 - Posted: 2 Nov 2018, 16:29:15 UTC - in response to Message 37169. for the past 6-7 hours, there have been problems with the LHCb tasks. First, none were available; then, new ones could be downloaded, but they error out with 207 (0x000000CF) EXIT_NO_SUB_TASKS So, there are tasks, but not jobs. Some time ago, I read about a newly installed mechanism that would stop the creation of new tasks once there are no jobs. Obviously, this does not work. ID: 37173 · Reply Quote

djoser Send message Joined: 30 Aug 14 Posts: 145 Credit: 10,847,070 RAC: 0	Message 37175 - Posted: 2 Nov 2018, 17:05:45 UTC Statistics on https://lhcathome.cern.ch/lhcathome/lhcb_job.php doesn't look good either... Could someone from LHC-Team please look into that? I hope they didn't left for the weekend yet. Why mine when you can research? - GRIDCOIN - Real cryptocurrency without wasting hashes! https://gridcoin.us ID: 37175 · Reply Quote

bronco Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0	Message 37176 - Posted: 2 Nov 2018, 17:16:04 UTC This is actually a positive development. Instead of doing nothing for many hours then exiting and awarding credits for nothing and leaving volunteers with the false idea that they did something useful, now they error out and give no credits which might eventually cause volunteers to disable LHCb like they should have done months ago. ID: 37176 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1984 Credit: 161,886,196 RAC: 81,830	Message 37177 - Posted: 2 Nov 2018, 17:20:10 UTC - in response to Message 37175. Could someone from LHC-Team please look into that? given the fact that the LHCb jobs have been rather erronous for several weeks now (for example: 14 hours total runtime with 45 minutes CPU time, and so on) and no-one at LHC noticed that, I'm afraid that your request for someone from the Team to look into the current problem will not become fullfilled :-( ID: 37177 · Reply Quote

djoser Send message Joined: 30 Aug 14 Posts: 145 Credit: 10,847,070 RAC: 0	Message 37268 - Posted: 7 Nov 2018, 17:27:21 UTC - in response to Message 37177. I'm afraid that your request for someone from the Team to look into the current problem will not become fullfilled :-( Well, Erich, it's been full 4 days now and guess what? You are right! It's a pity... Why mine when you can research? - GRIDCOIN - Real cryptocurrency without wasting hashes! https://gridcoin.us ID: 37268 · Reply Quote

BITLab Argo Send message Joined: 16 Jul 05 Posts: 24 Credit: 35,251,537 RAC: 0	Message 37269 - Posted: 7 Nov 2018, 18:15:46 UTC - in response to Message 37268. Surely,if they run out of work because we've crunched it all, that's a good thing. Would be nice if someone said something though! What puzzles me is that the "Server status" page has Runtime of last 100 tasks in hours: average, min, max LHCb Simulation 2.05 (0.29 - 18.12) which suggests that a few people must be getting tasks, else surely all the values would be around 0.3h (20min.) by now? ID: 37269 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1984 Credit: 161,886,196 RAC: 81,830	Message 37270 - Posted: 7 Nov 2018, 19:39:25 UTC - in response to Message 37269. ... which suggests that a few people must be getting tasks ... yes, once in a while, new tasks are available; but they don't make much sense, because in most cases they error out because of not getting jobs - 207 (0x000000CF) EXIT_NO_SUB_TASKS. An if the tasks get jobs, overall CPU usage usually is max. about 5-10 % out of the total runtime of the task. Something has been very wrong with the LHCb tasks for quite some time, but no one at LHC seems to care :-( ID: 37270 · Reply Quote

BITLab Argo Send message Joined: 16 Jul 05 Posts: 24 Credit: 35,251,537 RAC: 0	Message 37281 - Posted: 8 Nov 2018, 9:55:28 UTC - in response to Message 37270. Sorry, I got the lingo wrong. It should have been: Runtime of last 100 tasks in hours: average, min, max LHCb Simulation 2.05 (0.29 - 18.12) ... which suggests that a few people must be getting jobs/sub-tasks, else all LHCb tasks would fail after 20 minutes (207 (0x000000CF) EXIT_NO_SUB_TASKS)? And today the page reports Runtime of last 100 tasks in hours: average, min, max LHCb Simulation 4.96 (0.37 - 18.12) so the server sees more tasks running longer. ID: 37281 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2753 Credit: 303,901,775 RAC: 110,208	Message 37282 - Posted: 8 Nov 2018, 10:34:20 UTC - in response to Message 37281. It does only show that some tasks run up to 18h. Then the watchdog terminates the VM. Unfortunately LHCb VMs do not always notice that the do nothing. The job activity page shows that there was not a single job for more than 24h: https://lhcathome.cern.ch/lhcathome/lhcb_job.php BTW: Don't expect an explanation from the LHCb project team. I asked for that a couple of times since mid of 2017 and never got a response. ID: 37282 · Reply Quote

Laurence Project administrator Project developer Send message Joined: 20 Jun 14 Posts: 431 Credit: 256,106 RAC: 54	Message 37283 - Posted: 8 Nov 2018, 10:38:42 UTC - in response to Message 37282. The LHCb tasks have been stopped for now. ID: 37283 · Reply Quote

BITLab Argo Send message Joined: 16 Jul 05 Posts: 24 Credit: 35,251,537 RAC: 0	Message 37284 - Posted: 8 Nov 2018, 11:09:19 UTC - in response to Message 37270. An if the tasks get jobs, overall CPU usage usually is max. about 5-10 % out of the total runtime of the task. Something has been very wrong with the LHCb tasks for quite some time, but no one at LHC seems to care :-( I was seeing slightly better, but the interesting thing was that a single-core LHCb task would show a sustained loadavg of about 0.3 which was broadly in line with the reported task efficiency (CPU-time/wallclock-run-time); that seems too little CPU for actual number-crunching, but way too much for a VM idling or transferring data. ID: 37284 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2753 Credit: 303,901,775 RAC: 110,208	Message 37285 - Posted: 8 Nov 2018, 11:52:27 UTC - in response to Message 37284. An if the tasks get jobs, overall CPU usage usually is max. about 5-10 % out of the total runtime of the task. Something has been very wrong with the LHCb tasks for quite some time, but no one at LHC seems to care :-( I was seeing slightly better, but the interesting thing was that a single-core LHCb task would show a sustained loadavg of about 0.3 which was broadly in line with the reported task efficiency (CPU-time/wallclock-run-time); that seems too little CPU for actual number-crunching, but way too much for a VM idling or transferring data. Which loadavg do you refer to? 1 min, 5 min, 15 min? This is different to CPU-time/wallclock. LHCb's scientific app appears in the process list as a python script and should run at 100 % (or close to). Be aware that the 1st calculation phase of this script runs only for roughly 1 min. I often noticed jobs that dropped to 0 % after that phase for 1-1.5h until the jobs were cancelled and the VM requested the next one. In this case the remaining CPU load is caused by other processes like watchdogs or the condor daemon. Laurence wrote: The LHCb tasks have been stopped for now. Best idea of the day. ID: 37285 · Reply Quote

BITLab Argo Send message Joined: 16 Jul 05 Posts: 24 Credit: 35,251,537 RAC: 0	Message 37286 - Posted: 8 Nov 2018, 13:16:36 UTC - in response to Message 37285. ...the interesting thing was that a single-core LHCb task would show a sustained loadavg of about 0.3... Which loadavg do you refer to? 1 min, 5 min, 15 min? All three - they weren't identical, but all around an intermediate value neither idle nor running flat out (and I looked quite a few times). This was for single-core tasks, that were the only thing running on a two-core machine. This is different to CPU-time/wallclock. I know these are measurements of different things, and the latter's poor efficiency values have caused lots of complaints on these boards. I'm puzzled because I would expect the obvious causes (e.g. slow payload downloads, limited availability of sub-tasks) would cause the load to flip between 0 and ~100% values. 10 minutes idle followed by 5 minutes flat out would give ~.3 on the 15min number, but it should be an exception to see it on the 1 min number at the same time. Hence my mentioning it: it's like - for me - the poor efficiency is because the code never runs flat out for some reason (have they requested a 1GHz Virtual CPU on a 3GHz machine? :-) ). LHCb's scientific app appears in the process list as a python script and should run at 100 % (or close to). I can't remember seeing that - I thought it was all wrapped up inside the VirtualBox process. Watchdogs, Condor et al. shouldn't be using significant CPU for any length of time. Be aware that the 1st calculation phase of this script runs only for roughly 1 min. I often noticed jobs that dropped to 0 % after that phase for 1-1.5h until the jobs were cancelled and the VM requested the next one. My understanding - which may very well be wrong and need updating! - is that the VBox job workflow goes something like: BOINC client wrapper starts the VM something in the VM, I believe now Condor on all expts, requests from the experiment to be allocated job(s)/sub-tasks(s) the experiment's data payload for each is pulled down within the VM the experiment's software - on CVMFS mounted within the VM - is pulled down (for the first job) and run over the payload the results file is uploaded Condor requests another job/sub-task until the VM has been running for 10 hrs, after which no new jobs/sub-tasks are requested and the VM stops when the final running one is finished (or is killed after 18hrs by the wrapper). Thus there is anyway a lot of potential CPU-idle time during the network transfers. Naively, it looks in your example like the sub-task starts but fails to access something, either the payload or the software (CVMFS), until Condor loses patience and gives up on it? While mine seem to start but then crawl rather than run. ID: 37286 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2753 Credit: 303,901,775 RAC: 110,208	Message 37287 - Posted: 8 Nov 2018, 13:55:57 UTC - in response to Message 37286. My understanding - which may very well be wrong and need updating! - is that the VBox job workflow goes something like: BOINC client wrapper starts the VM something in the VM, I believe now Condor on all expts, requests from the experiment to be allocated job(s)/sub-tasks(s) the experiment's data payload for each is pulled down within the VM the experiment's software - on CVMFS mounted within the VM - is pulled down (for the first job) and run over the payload the results file is uploaded Condor requests another job/sub-task until the VM has been running for 10 hrs, after which no new jobs/sub-tasks are requested and the VM stops when the final running one is finished (or is killed after 18hrs by the wrapper). Thus there is anyway a lot of potential CPU-idle time during the network transfers. Naively, it looks in your example like the sub-task starts but fails to access something, either the payload or the software (CVMFS), until Condor loses patience and gives up on it? While mine seem to start but then crawl rather than run. That's a good summary. I have only a few minor comments: 1. The lifetime of a VM is usually 12h+ instead of 10h+ 2. Console 4 logs when a job starts and ends (although it doesn't show if it idles) 3. On my hosts each (good) job runs roughly 80 min and produces roughly 1MB/min data. 4. VM setup and job setup pulls lots of data via internet. This can be supported by a proxy: https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4611 5. Upload times of the job results depend on your upload bandwidth. In my case (10Mbit) typically a bit more than 1 min/job result ID: 37287 · Reply Quote