Message boards : LHCb Application : 207 (0x000000CF) EXIT_NO_SUB_TASKS
Message board moderation

To post messages, you must log in.

AuthorMessage
djoser
Avatar

Send message
Joined: 30 Aug 14
Posts: 145
Credit: 10,847,070
RAC: 0
Message 37169 - Posted: 2 Nov 2018, 14:33:06 UTC

Hi there!

Since a few minutes ago all LHCb tasks fail with error code 207!

Regards, djoser.
Why mine when you can research? - GRIDCOIN - Real cryptocurrency without wasting hashes! https://gridcoin.us
ID: 37169 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,337,609
RAC: 102,021
Message 37173 - Posted: 2 Nov 2018, 16:29:15 UTC - in response to Message 37169.  

for the past 6-7 hours, there have been problems with the LHCb tasks.
First, none were available; then, new ones could be downloaded, but they error out with

207 (0x000000CF) EXIT_NO_SUB_TASKS

So, there are tasks, but not jobs.

Some time ago, I read about a newly installed mechanism that would stop the creation of new tasks once there are no jobs.
Obviously, this does not work.
ID: 37173 · Report as offensive     Reply Quote
djoser
Avatar

Send message
Joined: 30 Aug 14
Posts: 145
Credit: 10,847,070
RAC: 0
Message 37175 - Posted: 2 Nov 2018, 17:05:45 UTC

Statistics on
https://lhcathome.cern.ch/lhcathome/lhcb_job.php
doesn't look good either...

Could someone from LHC-Team please look into that?
I hope they didn't left for the weekend yet.
Why mine when you can research? - GRIDCOIN - Real cryptocurrency without wasting hashes! https://gridcoin.us
ID: 37175 · Report as offensive     Reply Quote
bronco

Send message
Joined: 13 Apr 18
Posts: 443
Credit: 8,438,885
RAC: 0
Message 37176 - Posted: 2 Nov 2018, 17:16:04 UTC

This is actually a positive development. Instead of doing nothing for many hours then exiting and awarding credits for nothing and leaving volunteers with the false idea that they did something useful, now they error out and give no credits which might eventually cause volunteers to disable LHCb like they should have done months ago.
ID: 37176 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,337,609
RAC: 102,021
Message 37177 - Posted: 2 Nov 2018, 17:20:10 UTC - in response to Message 37175.  

Could someone from LHC-Team please look into that?
given the fact that the LHCb jobs have been rather erronous for several weeks now (for example: 14 hours total runtime with 45 minutes CPU time, and so on) and no-one at LHC noticed that, I'm afraid that your request for someone from the Team to look into the current problem will not become fullfilled :-(
ID: 37177 · Report as offensive     Reply Quote
djoser
Avatar

Send message
Joined: 30 Aug 14
Posts: 145
Credit: 10,847,070
RAC: 0
Message 37268 - Posted: 7 Nov 2018, 17:27:21 UTC - in response to Message 37177.  

I'm afraid that your request for someone from the Team to look into the current problem will not become fullfilled :-(


Well, Erich, it's been full 4 days now and guess what? You are right!
It's a pity...
Why mine when you can research? - GRIDCOIN - Real cryptocurrency without wasting hashes! https://gridcoin.us
ID: 37268 · Report as offensive     Reply Quote
BITLab Argo

Send message
Joined: 16 Jul 05
Posts: 24
Credit: 35,251,537
RAC: 0
Message 37269 - Posted: 7 Nov 2018, 18:15:46 UTC - in response to Message 37268.  

Surely,if they run out of work because we've crunched it all, that's a good thing. Would be nice if someone said something though!

What puzzles me is that the "Server status" page has
Runtime of last 100 tasks in hours: average, min, max
LHCb Simulation 2.05 (0.29 - 18.12)
which suggests that a few people must be getting tasks, else surely all the values would be around 0.3h (20min.) by now?
ID: 37269 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,337,609
RAC: 102,021
Message 37270 - Posted: 7 Nov 2018, 19:39:25 UTC - in response to Message 37269.  

... which suggests that a few people must be getting tasks ...
yes, once in a while, new tasks are available; but they don't make much sense, because in most cases they error out because of not getting jobs - 207 (0x000000CF) EXIT_NO_SUB_TASKS.

An if the tasks get jobs, overall CPU usage usually is max. about 5-10 % out of the total runtime of the task.
Something has been very wrong with the LHCb tasks for quite some time, but no one at LHC seems to care :-(
ID: 37270 · Report as offensive     Reply Quote
BITLab Argo

Send message
Joined: 16 Jul 05
Posts: 24
Credit: 35,251,537
RAC: 0
Message 37281 - Posted: 8 Nov 2018, 9:55:28 UTC - in response to Message 37270.  

Sorry, I got the lingo wrong. It should have been:

Runtime of last 100 tasks in hours: average, min, max
LHCb Simulation 2.05 (0.29 - 18.12)
... which suggests that a few people must be getting jobs/sub-tasks, else all LHCb tasks would fail after 20 minutes (207 (0x000000CF) EXIT_NO_SUB_TASKS)?

And today the page reports
Runtime of last 100 tasks in hours: average, min, max
LHCb Simulation 4.96 (0.37 - 18.12)
so the server sees more tasks running longer.
ID: 37281 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,890,945
RAC: 138,176
Message 37282 - Posted: 8 Nov 2018, 10:34:20 UTC - in response to Message 37281.  

It does only show that some tasks run up to 18h.
Then the watchdog terminates the VM.

Unfortunately LHCb VMs do not always notice that the do nothing.

The job activity page shows that there was not a single job for more than 24h:
https://lhcathome.cern.ch/lhcathome/lhcb_job.php


BTW:
Don't expect an explanation from the LHCb project team.
I asked for that a couple of times since mid of 2017 and never got a response.
ID: 37282 · Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer

Send message
Joined: 20 Jun 14
Posts: 372
Credit: 238,712
RAC: 0
Message 37283 - Posted: 8 Nov 2018, 10:38:42 UTC - in response to Message 37282.  

The LHCb tasks have been stopped for now.
ID: 37283 · Report as offensive     Reply Quote
BITLab Argo

Send message
Joined: 16 Jul 05
Posts: 24
Credit: 35,251,537
RAC: 0
Message 37284 - Posted: 8 Nov 2018, 11:09:19 UTC - in response to Message 37270.  

An if the tasks get jobs, overall CPU usage usually is max. about 5-10 % out of the total runtime of the task.
Something has been very wrong with the LHCb tasks for quite some time, but no one at LHC seems to care :-(


I was seeing slightly better, but the interesting thing was that a single-core LHCb task would show a sustained loadavg of about 0.3 which was broadly in line with the reported task efficiency (CPU-time/wallclock-run-time); that seems too little CPU for actual number-crunching, but way too much for a VM idling or transferring data.
ID: 37284 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,890,945
RAC: 138,176
Message 37285 - Posted: 8 Nov 2018, 11:52:27 UTC - in response to Message 37284.  

An if the tasks get jobs, overall CPU usage usually is max. about 5-10 % out of the total runtime of the task.
Something has been very wrong with the LHCb tasks for quite some time, but no one at LHC seems to care :-(


I was seeing slightly better, but the interesting thing was that a single-core LHCb task would show a sustained loadavg of about 0.3 which was broadly in line with the reported task efficiency (CPU-time/wallclock-run-time); that seems too little CPU for actual number-crunching, but way too much for a VM idling or transferring data.

Which loadavg do you refer to?
1 min, 5 min, 15 min?

This is different to CPU-time/wallclock.


LHCb's scientific app appears in the process list as a python script and should run at 100 % (or close to).

Be aware that the 1st calculation phase of this script runs only for roughly 1 min.
I often noticed jobs that dropped to 0 % after that phase for 1-1.5h until the jobs were cancelled and the VM requested the next one.
In this case the remaining CPU load is caused by other processes like watchdogs or the condor daemon.


Laurence wrote:
The LHCb tasks have been stopped for now.

Best idea of the day.
ID: 37285 · Report as offensive     Reply Quote
BITLab Argo

Send message
Joined: 16 Jul 05
Posts: 24
Credit: 35,251,537
RAC: 0
Message 37286 - Posted: 8 Nov 2018, 13:16:36 UTC - in response to Message 37285.  

...the interesting thing was that a single-core LHCb task would show a sustained loadavg of about 0.3...
Which loadavg do you refer to?
1 min, 5 min, 15 min?
All three - they weren't identical, but all around an intermediate value neither idle nor running flat out (and I looked quite a few times). This was for single-core tasks, that were the only thing running on a two-core machine.

This is different to CPU-time/wallclock.
I know these are measurements of different things, and the latter's poor efficiency values have caused lots of complaints on these boards. I'm puzzled because I would expect the obvious causes (e.g. slow payload downloads, limited availability of sub-tasks) would cause the load to flip between 0 and ~100% values. 10 minutes idle followed by 5 minutes flat out would give ~.3 on the 15min number, but it should be an exception to see it on the 1 min number at the same time.

Hence my mentioning it: it's like - for me - the poor efficiency is because the code never runs flat out for some reason (have they requested a 1GHz Virtual CPU on a 3GHz machine? :-) ).

LHCb's scientific app appears in the process list as a python script and should run at 100 % (or close to).
I can't remember seeing that - I thought it was all wrapped up inside the VirtualBox process. Watchdogs, Condor et al. shouldn't be using significant CPU for any length of time.

Be aware that the 1st calculation phase of this script runs only for roughly 1 min.
I often noticed jobs that dropped to 0 % after that phase for 1-1.5h until the jobs were cancelled and the VM requested the next one.

My understanding - which may very well be wrong and need updating! - is that the VBox job workflow goes something like:

  • BOINC client wrapper starts the VM
  • something in the VM, I believe now Condor on all expts, requests from the experiment to be allocated job(s)/sub-tasks(s)
  • the experiment's data payload for each is pulled down within the VM
  • the experiment's software - on CVMFS mounted within the VM - is pulled down (for the first job) and run over the payload
  • the results file is uploaded
  • Condor requests another job/sub-task

until the VM has been running for 10 hrs, after which no new jobs/sub-tasks are requested and the VM stops when the final running one is finished (or is killed after 18hrs by the wrapper).

Thus there is anyway a lot of potential CPU-idle time during the network transfers. Naively, it looks in your example like the sub-task starts but fails to access something, either the payload or the software (CVMFS), until Condor loses patience and gives up on it? While mine seem to start but then crawl rather than run.


ID: 37286 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,890,945
RAC: 138,176
Message 37287 - Posted: 8 Nov 2018, 13:55:57 UTC - in response to Message 37286.  

My understanding - which may very well be wrong and need updating! - is that the VBox job workflow goes something like:

  • BOINC client wrapper starts the VM
  • something in the VM, I believe now Condor on all expts, requests from the experiment to be allocated job(s)/sub-tasks(s)
  • the experiment's data payload for each is pulled down within the VM
  • the experiment's software - on CVMFS mounted within the VM - is pulled down (for the first job) and run over the payload
  • the results file is uploaded
  • Condor requests another job/sub-task

until the VM has been running for 10 hrs, after which no new jobs/sub-tasks are requested and the VM stops when the final running one is finished (or is killed after 18hrs by the wrapper).

Thus there is anyway a lot of potential CPU-idle time during the network transfers. Naively, it looks in your example like the sub-task starts but fails to access something, either the payload or the software (CVMFS), until Condor loses patience and gives up on it? While mine seem to start but then crawl rather than run.


That's a good summary.

I have only a few minor comments:
1. The lifetime of a VM is usually 12h+ instead of 10h+
2. Console 4 logs when a job starts and ends (although it doesn't show if it idles)
3. On my hosts each (good) job runs roughly 80 min and produces roughly 1MB/min data.
4. VM setup and job setup pulls lots of data via internet. This can be supported by a proxy:
https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4611
5. Upload times of the job results depend on your upload bandwidth.
In my case (10Mbit) typically a bit more than 1 min/job result
ID: 37287 · Report as offensive     Reply Quote

Message boards : LHCb Application : 207 (0x000000CF) EXIT_NO_SUB_TASKS


©2024 CERN