Message boards :
LHCb Application :
207 (0x000000CF) EXIT_NO_SUB_TASKS
Message board moderation
Author | Message |
---|---|
Send message Joined: 30 Aug 14 Posts: 145 Credit: 10,847,070 RAC: 0 |
Hi there! Since a few minutes ago all LHCb tasks fail with error code 207! Regards, djoser. Why mine when you can research? - GRIDCOIN - Real cryptocurrency without wasting hashes! https://gridcoin.us |
Send message Joined: 18 Dec 15 Posts: 1821 Credit: 118,941,501 RAC: 21,202 |
for the past 6-7 hours, there have been problems with the LHCb tasks. First, none were available; then, new ones could be downloaded, but they error out with 207 (0x000000CF) EXIT_NO_SUB_TASKS So, there are tasks, but not jobs. Some time ago, I read about a newly installed mechanism that would stop the creation of new tasks once there are no jobs. Obviously, this does not work. |
Send message Joined: 30 Aug 14 Posts: 145 Credit: 10,847,070 RAC: 0 |
Statistics on https://lhcathome.cern.ch/lhcathome/lhcb_job.php doesn't look good either... Could someone from LHC-Team please look into that? I hope they didn't left for the weekend yet. Why mine when you can research? - GRIDCOIN - Real cryptocurrency without wasting hashes! https://gridcoin.us |
Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0 |
This is actually a positive development. Instead of doing nothing for many hours then exiting and awarding credits for nothing and leaving volunteers with the false idea that they did something useful, now they error out and give no credits which might eventually cause volunteers to disable LHCb like they should have done months ago. |
Send message Joined: 18 Dec 15 Posts: 1821 Credit: 118,941,501 RAC: 21,202 |
Could someone from LHC-Team please look into that?given the fact that the LHCb jobs have been rather erronous for several weeks now (for example: 14 hours total runtime with 45 minutes CPU time, and so on) and no-one at LHC noticed that, I'm afraid that your request for someone from the Team to look into the current problem will not become fullfilled :-( |
Send message Joined: 30 Aug 14 Posts: 145 Credit: 10,847,070 RAC: 0 |
I'm afraid that your request for someone from the Team to look into the current problem will not become fullfilled :-( Well, Erich, it's been full 4 days now and guess what? You are right! It's a pity... Why mine when you can research? - GRIDCOIN - Real cryptocurrency without wasting hashes! https://gridcoin.us |
Send message Joined: 16 Jul 05 Posts: 24 Credit: 35,251,537 RAC: 0 |
Surely,if they run out of work because we've crunched it all, that's a good thing. Would be nice if someone said something though! What puzzles me is that the "Server status" page has Runtime of last 100 tasks in hours: average, min, maxwhich suggests that a few people must be getting tasks, else surely all the values would be around 0.3h (20min.) by now? |
Send message Joined: 18 Dec 15 Posts: 1821 Credit: 118,941,501 RAC: 21,202 |
... which suggests that a few people must be getting tasks ...yes, once in a while, new tasks are available; but they don't make much sense, because in most cases they error out because of not getting jobs - 207 (0x000000CF) EXIT_NO_SUB_TASKS. An if the tasks get jobs, overall CPU usage usually is max. about 5-10 % out of the total runtime of the task. Something has been very wrong with the LHCb tasks for quite some time, but no one at LHC seems to care :-( |
Send message Joined: 16 Jul 05 Posts: 24 Credit: 35,251,537 RAC: 0 |
Sorry, I got the lingo wrong. It should have been: Runtime of last 100 tasks in hours: average, min, max... which suggests that a few people must be getting jobs/sub-tasks, else all LHCb tasks would fail after 20 minutes (207 (0x000000CF) EXIT_NO_SUB_TASKS)? And today the page reports Runtime of last 100 tasks in hours: average, min, maxso the server sees more tasks running longer. |
Send message Joined: 15 Jun 08 Posts: 2541 Credit: 254,608,838 RAC: 34,609 |
It does only show that some tasks run up to 18h. Then the watchdog terminates the VM. Unfortunately LHCb VMs do not always notice that the do nothing. The job activity page shows that there was not a single job for more than 24h: https://lhcathome.cern.ch/lhcathome/lhcb_job.php BTW: Don't expect an explanation from the LHCb project team. I asked for that a couple of times since mid of 2017 and never got a response. |
Send message Joined: 20 Jun 14 Posts: 380 Credit: 238,712 RAC: 0 |
The LHCb tasks have been stopped for now. |
Send message Joined: 16 Jul 05 Posts: 24 Credit: 35,251,537 RAC: 0 |
An if the tasks get jobs, overall CPU usage usually is max. about 5-10 % out of the total runtime of the task. I was seeing slightly better, but the interesting thing was that a single-core LHCb task would show a sustained loadavg of about 0.3 which was broadly in line with the reported task efficiency (CPU-time/wallclock-run-time); that seems too little CPU for actual number-crunching, but way too much for a VM idling or transferring data. |
Send message Joined: 15 Jun 08 Posts: 2541 Credit: 254,608,838 RAC: 34,609 |
An if the tasks get jobs, overall CPU usage usually is max. about 5-10 % out of the total runtime of the task. Which loadavg do you refer to? 1 min, 5 min, 15 min? This is different to CPU-time/wallclock. LHCb's scientific app appears in the process list as a python script and should run at 100 % (or close to). Be aware that the 1st calculation phase of this script runs only for roughly 1 min. I often noticed jobs that dropped to 0 % after that phase for 1-1.5h until the jobs were cancelled and the VM requested the next one. In this case the remaining CPU load is caused by other processes like watchdogs or the condor daemon. Laurence wrote: The LHCb tasks have been stopped for now. Best idea of the day. |
Send message Joined: 16 Jul 05 Posts: 24 Credit: 35,251,537 RAC: 0 |
All three - they weren't identical, but all around an intermediate value neither idle nor running flat out (and I looked quite a few times). This was for single-core tasks, that were the only thing running on a two-core machine....the interesting thing was that a single-core LHCb task would show a sustained loadavg of about 0.3...Which loadavg do you refer to? This is different to CPU-time/wallclock.I know these are measurements of different things, and the latter's poor efficiency values have caused lots of complaints on these boards. I'm puzzled because I would expect the obvious causes (e.g. slow payload downloads, limited availability of sub-tasks) would cause the load to flip between 0 and ~100% values. 10 minutes idle followed by 5 minutes flat out would give ~.3 on the 15min number, but it should be an exception to see it on the 1 min number at the same time. Hence my mentioning it: it's like - for me - the poor efficiency is because the code never runs flat out for some reason (have they requested a 1GHz Virtual CPU on a 3GHz machine? :-) ). LHCb's scientific app appears in the process list as a python script and should run at 100 % (or close to).I can't remember seeing that - I thought it was all wrapped up inside the VirtualBox process. Watchdogs, Condor et al. shouldn't be using significant CPU for any length of time. Be aware that the 1st calculation phase of this script runs only for roughly 1 min. My understanding - which may very well be wrong and need updating! - is that the VBox job workflow goes something like:
until the VM has been running for 10 hrs, after which no new jobs/sub-tasks are requested and the VM stops when the final running one is finished (or is killed after 18hrs by the wrapper). |
Send message Joined: 15 Jun 08 Posts: 2541 Credit: 254,608,838 RAC: 34,609 |
My understanding - which may very well be wrong and need updating! - is that the VBox job workflow goes something like: That's a good summary. I have only a few minor comments: 1. The lifetime of a VM is usually 12h+ instead of 10h+ 2. Console 4 logs when a job starts and ends (although it doesn't show if it idles) 3. On my hosts each (good) job runs roughly 80 min and produces roughly 1MB/min data. 4. VM setup and job setup pulls lots of data via internet. This can be supported by a proxy: https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4611 5. Upload times of the job results depend on your upload bandwidth. In my case (10Mbit) typically a bit more than 1 min/job result |
©2024 CERN