Message boards : LHCb Application : New version v1.05
Message board moderation

To post messages, you must log in.

Previous · 1 · 2

AuthorMessage
bronco

Send message
Joined: 13 Apr 18
Posts: 443
Credit: 8,438,885
RAC: 0
Message 36924 - Posted: 29 Sep 2018, 16:48:15 UTC - in response to Message 36922.  
Last modified: 29 Sep 2018, 16:49:17 UTC

now the work log shows me some output stuttering/text doubling. when the output is malicious the code is malicious...

The stuttering has been noted already. Since it doesn't seem to be a critical (show stopping) error I doubt fixing it is a high priority for them.
It seems the severity of the stuttering increases with the number of cores/threads per task. Even more important, we have advice from the project devs that tells us 8 cores per task is very inefficient. It's a way to jam a lot of threads into limited RAM but we sacrifice efficiency in doing so.

Your 32 GB RAM is more than sufficient for 8 X 1 core tasks. You'll get far more work done and no stuttering, probably more credits too.
The RAM calculation for LHCb is:
2048 MB for the first thread plus 1300 MB for each additional core.
8 single core LHCb ==> 16,348 MB (16 GB)
ID: 36924 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1811
Credit: 118,369,810
RAC: 25,478
Message 36939 - Posted: 2 Oct 2018, 16:33:36 UTC - in response to Message 36838.  

I wrote a few days ago:
I have several LHCb tasks running for more than 3 hours now, and in the properties I notice a processor time of about 20 minutes.
Is this by desing, or are these tasks faulty?

computezrmle wrote:
Sad to say, it's not really working.
The jobs inside the VM are missing some data from or a connection to a backend system at CERN.
Thus they stand by until a timeout shuts them down and the VM requests another job.
I set my systems to NNT.

the problem still persists; I started a LHCb task 3 hours ago, and the properties show a CPU time of 20 minutes :-(
Would be great if someone looked into this.
ID: 36939 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1811
Credit: 118,369,810
RAC: 25,478
Message 36942 - Posted: 3 Oct 2018, 4:45:14 UTC - in response to Message 36939.  

the problem still persists; I started a LHCb task 3 hours ago, and the properties show a CPU time of 20 minutes :-(
Would be great if someone looked into this.
here the datailed report of the finished task:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=207354845
total time 12 1/2 hours, CPU time 1 1/2 hours.
ID: 36942 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1811
Credit: 118,369,810
RAC: 25,478
Message 36979 - Posted: 8 Oct 2018, 19:32:39 UTC - in response to Message 36942.  

the problem still persists; I started a LHCb task 3 hours ago, and the properties show a CPU time of 20 minutes :-(
Would be great if someone looked into this.
here the datailed report of the finished task:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=207354845
total time 12 1/2 hours, CPU time 1 1/2 hours.

can anyone from LHC tell us whether the phenomenon described above is by design, or whether something is wrong with the LHCb tasks.
Thanks.
ID: 36979 · Report as offensive     Reply Quote
BITLab Argo

Send message
Joined: 16 Jul 05
Posts: 24
Credit: 35,251,537
RAC: 0
Message 36983 - Posted: 9 Oct 2018, 12:04:02 UTC - in response to Message 36942.  
Last modified: 9 Oct 2018, 12:07:38 UTC

the problem still persists; I started a LHCb task 3 hours ago, and the properties show a CPU time of 20 minutes :-(
Would be great if someone looked into this.


My (single-core) LHCb jobs also have very poor efficiencies: 10 - 50%.

Random job's log snippet:
2018-10-09 01:39:15 (7650): Status Report: Job Duration: '64800.000000'
2018-10-09 01:39:15 (7650): Status Report: Elapsed Time: '6000.000000'
2018-10-09 01:39:15 (7650): Status Report: CPU Time: '954.280000'

It does look like the pilots are getting work from Condor, but I haven't poked around enough to work out if the poor efficiency comes from waiting on subsequent downloads/uploads.
ID: 36983 · Report as offensive     Reply Quote
m

Send message
Joined: 6 Sep 08
Posts: 118
Credit: 12,560,503
RAC: 214
Message 36984 - Posted: 9 Oct 2018, 13:22:35 UTC
Last modified: 9 Oct 2018, 13:43:34 UTC

For the last few days, most of my LHCb tasks have failed no heartbeat file, some with missing heartbeat.
Various hosts, Theory tasks are OK. Has the "faster startup" change been lost somehow, or is something else awry? Something has changed.
ID: 36984 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2531
Credit: 253,722,201
RAC: 34,439
Message 36985 - Posted: 9 Oct 2018, 14:06:09 UTC - in response to Message 36984.  

As of my experience the CPU usage of lhc VBox WUs should be planned using a calculation factor of 1.3.
Thus a 2 core CPU like the ones you use to run LHCb would be able to run a 1-core setup.

Your logs show that you run them as a 2-core setup.

To test if your hosts are able to deal with a 2-core setup you may monitor your host's "load average" using top or htop.
The values should not be much higher than the number of CPU cores. Otherwise it indicates a too busy system.

If your system is too busy you may consider to run your VBox tasks as a 1-core setup.
ID: 36985 · Report as offensive     Reply Quote
m

Send message
Joined: 6 Sep 08
Posts: 118
Credit: 12,560,503
RAC: 214
Message 36998 - Posted: 10 Oct 2018, 11:38:43 UTC - in response to Message 36985.  
Last modified: 10 Oct 2018, 11:46:39 UTC

Many thanks for the comments.

... CPU usage of lhc VBox WUs should be planned using a calculation factor of 1.3.
Thus a 2 core CPU like the ones you use to run LHCb would be able to run a 1-core setup.

They will, it's just that they used (still do sometimes) to run 2 core jobs without obvious problems.
To test if your hosts are able to deal with a 2-core setup you may monitor your host's "load average" using top or htop. The values should not be much higher than the number of CPU cores. Otherwise it indicates a too busy system.

I'm sure you're right.
There are four hosts here that normally run LHCb (along with others) - stayed up late to last night to check...
1 Running LHCb (2 core) and TNGrid jobs, load average 1.2 - 1.5 The LHCb was running OK, although, as others have commented, the CPU time looks a bit low.
2. Running LHCb(single core) and TNGrid. Load average 1.2 - 1.3
3 Running two TNGrid, load average 2.01 - 2.1
4. Running TNGrid and Theory (single core). Load average 2.01
If your system is too busy you may consider to run your VBox tasks as a 1-core setup.

I don't know it these loads are too high, but they're running single core now so I'll see what happens.
I expect that the loads depend on the exact nature of the job being processed at the time and the hosts are just too close to their limit. It will be a few days before I know.
ID: 36998 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2531
Credit: 253,722,201
RAC: 34,439
Message 36999 - Posted: 10 Oct 2018, 12:52:11 UTC - in response to Message 36998.  

LHCb is ATM not good for benchmarking as there are long idle periods that pull the load average down.
Usually a 2-core LHCb VM provides 2 independent job slots each of them requests a full core for 80-100 minutes on up to date CPUs.
They should be idle only during the upload of intermediate results (50-90 MB each).
Extended idle phases point out a problem - most likely within the CERN infrastructure.
ID: 36999 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1811
Credit: 118,369,810
RAC: 25,478
Message 37000 - Posted: 10 Oct 2018, 16:08:03 UTC - in response to Message 36999.  

Extended idle phases point out a problem - most likely within the CERN infrastructure.
and I am surprised that it hasn't been fixed yet, after it already exists for quite a while, and some postings here have pointed out this problem.
ID: 37000 · Report as offensive     Reply Quote
bronco

Send message
Joined: 13 Apr 18
Posts: 443
Credit: 8,438,885
RAC: 0
Message 37001 - Posted: 10 Oct 2018, 16:36:59 UTC - in response to Message 37000.  

Extended idle phases point out a problem - most likely within the CERN infrastructure.
and I am surprised that it hasn't been fixed yet, after it already exists for quite a while, and some postings here have pointed out this problem.
They have the compute power and network bandwidth to generate and distribute many thousands of tasks to misconfigured hosts that haven't returned a valid result in months while reliable hosts starve for LHCb jobs. Some combination of laziness and incompetence, I guess.
ID: 37001 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1811
Credit: 118,369,810
RAC: 25,478
Message 37029 - Posted: 15 Oct 2018, 6:47:46 UTC

I am really surprised that no-one at LHC has noticed yet - despite of the positings here - that something has been going wrong with the LHCb tasks for quite a while.
Still, CPU time is not more that 10-15% of the total runtime of a task.
ID: 37029 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1811
Credit: 118,369,810
RAC: 25,478
Message 37059 - Posted: 18 Oct 2018, 16:53:43 UTC - in response to Message 37029.  

I just noticed another "particular" LHCb task: it has been runnig for more than 16 hours now, with processor time 41 minutes :-(((

No-one at LHC@home able to look into this?
ID: 37059 · Report as offensive     Reply Quote
Previous · 1 · 2

Message boards : LHCb Application : New version v1.05


©2024 CERN