Message boards :
LHCb Application :
New version v1.05
Message board moderation
Previous · 1 · 2
Author | Message |
---|---|
Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0 |
now the work log shows me some output stuttering/text doubling. when the output is malicious the code is malicious... The stuttering has been noted already. Since it doesn't seem to be a critical (show stopping) error I doubt fixing it is a high priority for them. It seems the severity of the stuttering increases with the number of cores/threads per task. Even more important, we have advice from the project devs that tells us 8 cores per task is very inefficient. It's a way to jam a lot of threads into limited RAM but we sacrifice efficiency in doing so. Your 32 GB RAM is more than sufficient for 8 X 1 core tasks. You'll get far more work done and no stuttering, probably more credits too. The RAM calculation for LHCb is: 2048 MB for the first thread plus 1300 MB for each additional core. 8 single core LHCb ==> 16,348 MB (16 GB) |
Send message Joined: 18 Dec 15 Posts: 1811 Credit: 118,369,810 RAC: 25,478 |
I wrote a few days ago: I have several LHCb tasks running for more than 3 hours now, and in the properties I notice a processor time of about 20 minutes. computezrmle wrote: Sad to say, it's not really working. the problem still persists; I started a LHCb task 3 hours ago, and the properties show a CPU time of 20 minutes :-( Would be great if someone looked into this. |
Send message Joined: 18 Dec 15 Posts: 1811 Credit: 118,369,810 RAC: 25,478 |
the problem still persists; I started a LHCb task 3 hours ago, and the properties show a CPU time of 20 minutes :-(here the datailed report of the finished task: https://lhcathome.cern.ch/lhcathome/result.php?resultid=207354845 total time 12 1/2 hours, CPU time 1 1/2 hours. |
Send message Joined: 18 Dec 15 Posts: 1811 Credit: 118,369,810 RAC: 25,478 |
the problem still persists; I started a LHCb task 3 hours ago, and the properties show a CPU time of 20 minutes :-(here the datailed report of the finished task: can anyone from LHC tell us whether the phenomenon described above is by design, or whether something is wrong with the LHCb tasks. Thanks. |
Send message Joined: 16 Jul 05 Posts: 24 Credit: 35,251,537 RAC: 0 |
the problem still persists; I started a LHCb task 3 hours ago, and the properties show a CPU time of 20 minutes :-( My (single-core) LHCb jobs also have very poor efficiencies: 10 - 50%. Random job's log snippet: 2018-10-09 01:39:15 (7650): Status Report: Job Duration: '64800.000000' 2018-10-09 01:39:15 (7650): Status Report: Elapsed Time: '6000.000000' 2018-10-09 01:39:15 (7650): Status Report: CPU Time: '954.280000' It does look like the pilots are getting work from Condor, but I haven't poked around enough to work out if the poor efficiency comes from waiting on subsequent downloads/uploads. |
Send message Joined: 6 Sep 08 Posts: 118 Credit: 12,560,503 RAC: 214 |
For the last few days, most of my LHCb tasks have failed no heartbeat file, some with missing heartbeat. Various hosts, Theory tasks are OK. Has the "faster startup" change been lost somehow, or is something else awry? Something has changed. |
Send message Joined: 15 Jun 08 Posts: 2531 Credit: 253,722,201 RAC: 34,439 |
As of my experience the CPU usage of lhc VBox WUs should be planned using a calculation factor of 1.3. Thus a 2 core CPU like the ones you use to run LHCb would be able to run a 1-core setup. Your logs show that you run them as a 2-core setup. To test if your hosts are able to deal with a 2-core setup you may monitor your host's "load average" using top or htop. The values should not be much higher than the number of CPU cores. Otherwise it indicates a too busy system. If your system is too busy you may consider to run your VBox tasks as a 1-core setup. |
Send message Joined: 6 Sep 08 Posts: 118 Credit: 12,560,503 RAC: 214 |
Many thanks for the comments. ... CPU usage of lhc VBox WUs should be planned using a calculation factor of 1.3. They will, it's just that they used (still do sometimes) to run 2 core jobs without obvious problems. To test if your hosts are able to deal with a 2-core setup you may monitor your host's "load average" using top or htop. The values should not be much higher than the number of CPU cores. Otherwise it indicates a too busy system. I'm sure you're right. There are four hosts here that normally run LHCb (along with others) - stayed up late to last night to check... 1 Running LHCb (2 core) and TNGrid jobs, load average 1.2 - 1.5 The LHCb was running OK, although, as others have commented, the CPU time looks a bit low. 2. Running LHCb(single core) and TNGrid. Load average 1.2 - 1.3 3 Running two TNGrid, load average 2.01 - 2.1 4. Running TNGrid and Theory (single core). Load average 2.01 If your system is too busy you may consider to run your VBox tasks as a 1-core setup. I don't know it these loads are too high, but they're running single core now so I'll see what happens. I expect that the loads depend on the exact nature of the job being processed at the time and the hosts are just too close to their limit. It will be a few days before I know. |
Send message Joined: 15 Jun 08 Posts: 2531 Credit: 253,722,201 RAC: 34,439 |
LHCb is ATM not good for benchmarking as there are long idle periods that pull the load average down. Usually a 2-core LHCb VM provides 2 independent job slots each of them requests a full core for 80-100 minutes on up to date CPUs. They should be idle only during the upload of intermediate results (50-90 MB each). Extended idle phases point out a problem - most likely within the CERN infrastructure. |
Send message Joined: 18 Dec 15 Posts: 1811 Credit: 118,369,810 RAC: 25,478 |
Extended idle phases point out a problem - most likely within the CERN infrastructure.and I am surprised that it hasn't been fixed yet, after it already exists for quite a while, and some postings here have pointed out this problem. |
Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0 |
They have the compute power and network bandwidth to generate and distribute many thousands of tasks to misconfigured hosts that haven't returned a valid result in months while reliable hosts starve for LHCb jobs. Some combination of laziness and incompetence, I guess.Extended idle phases point out a problem - most likely within the CERN infrastructure.and I am surprised that it hasn't been fixed yet, after it already exists for quite a while, and some postings here have pointed out this problem. |
Send message Joined: 18 Dec 15 Posts: 1811 Credit: 118,369,810 RAC: 25,478 |
I am really surprised that no-one at LHC has noticed yet - despite of the positings here - that something has been going wrong with the LHCb tasks for quite a while. Still, CPU time is not more that 10-15% of the total runtime of a task. |
Send message Joined: 18 Dec 15 Posts: 1811 Credit: 118,369,810 RAC: 25,478 |
I just noticed another "particular" LHCb task: it has been runnig for more than 16 hours now, with processor time 41 minutes :-((( No-one at LHC@home able to look into this? |
©2024 CERN