New version v1.05

Author	Message
bronco Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0	Message 36924 - Posted: 29 Sep 2018, 16:48:15 UTC - in response to Message 36922. Last modified: 29 Sep 2018, 16:49:17 UTC now the work log shows me some output stuttering/text doubling. when the output is malicious the code is malicious... The stuttering has been noted already. Since it doesn't seem to be a critical (show stopping) error I doubt fixing it is a high priority for them. It seems the severity of the stuttering increases with the number of cores/threads per task. Even more important, we have advice from the project devs that tells us 8 cores per task is very inefficient. It's a way to jam a lot of threads into limited RAM but we sacrifice efficiency in doing so. Your 32 GB RAM is more than sufficient for 8 X 1 core tasks. You'll get far more work done and no stuttering, probably more credits too. The RAM calculation for LHCb is: 2048 MB for the first thread plus 1300 MB for each additional core. 8 single core LHCb ==> 16,348 MB (16 GB) ID: 36924 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1923 Credit: 149,459,673 RAC: 143,551	Message 36939 - Posted: 2 Oct 2018, 16:33:36 UTC - in response to Message 36838. I wrote a few days ago: I have several LHCb tasks running for more than 3 hours now, and in the properties I notice a processor time of about 20 minutes. Is this by desing, or are these tasks faulty? computezrmle wrote: Sad to say, it's not really working. The jobs inside the VM are missing some data from or a connection to a backend system at CERN. Thus they stand by until a timeout shuts them down and the VM requests another job. I set my systems to NNT. the problem still persists; I started a LHCb task 3 hours ago, and the properties show a CPU time of 20 minutes :-( Would be great if someone looked into this. ID: 36939 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1923 Credit: 149,459,673 RAC: 143,551	Message 36942 - Posted: 3 Oct 2018, 4:45:14 UTC - in response to Message 36939. the problem still persists; I started a LHCb task 3 hours ago, and the properties show a CPU time of 20 minutes :-( Would be great if someone looked into this. here the datailed report of the finished task: https://lhcathome.cern.ch/lhcathome/result.php?resultid=207354845 total time 12 1/2 hours, CPU time 1 1/2 hours. ID: 36942 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1923 Credit: 149,459,673 RAC: 143,551	Message 36979 - Posted: 8 Oct 2018, 19:32:39 UTC - in response to Message 36942. the problem still persists; I started a LHCb task 3 hours ago, and the properties show a CPU time of 20 minutes :-( Would be great if someone looked into this. here the datailed report of the finished task: https://lhcathome.cern.ch/lhcathome/result.php?resultid=207354845 total time 12 1/2 hours, CPU time 1 1/2 hours. can anyone from LHC tell us whether the phenomenon described above is by design, or whether something is wrong with the LHCb tasks. Thanks. ID: 36979 · Reply Quote

BITLab Argo Send message Joined: 16 Jul 05 Posts: 24 Credit: 35,251,537 RAC: 0	Message 36983 - Posted: 9 Oct 2018, 12:04:02 UTC - in response to Message 36942. Last modified: 9 Oct 2018, 12:07:38 UTC the problem still persists; I started a LHCb task 3 hours ago, and the properties show a CPU time of 20 minutes :-( Would be great if someone looked into this. My (single-core) LHCb jobs also have very poor efficiencies: 10 - 50%. Random job's log snippet: 2018-10-09 01:39:15 (7650): Status Report: Job Duration: '64800.000000' 2018-10-09 01:39:15 (7650): Status Report: Elapsed Time: '6000.000000' 2018-10-09 01:39:15 (7650): Status Report: CPU Time: '954.280000' It does look like the pilots are getting work from Condor, but I haven't poked around enough to work out if the poor efficiency comes from waiting on subsequent downloads/uploads. ID: 36983 · Reply Quote

m Send message Joined: 6 Sep 08 Posts: 119 Credit: 14,641,719 RAC: 17,631	Message 36984 - Posted: 9 Oct 2018, 13:22:35 UTC Last modified: 9 Oct 2018, 13:43:34 UTC For the last few days, most of my LHCb tasks have failed no heartbeat file, some with missing heartbeat. Various hosts, Theory tasks are OK. Has the "faster startup" change been lost somehow, or is something else awry? Something has changed. ID: 36984 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2710 Credit: 292,033,908 RAC: 145,348	Message 36985 - Posted: 9 Oct 2018, 14:06:09 UTC - in response to Message 36984. As of my experience the CPU usage of lhc VBox WUs should be planned using a calculation factor of 1.3. Thus a 2 core CPU like the ones you use to run LHCb would be able to run a 1-core setup. Your logs show that you run them as a 2-core setup. To test if your hosts are able to deal with a 2-core setup you may monitor your host's "load average" using top or htop. The values should not be much higher than the number of CPU cores. Otherwise it indicates a too busy system. If your system is too busy you may consider to run your VBox tasks as a 1-core setup. ID: 36985 · Reply Quote

m Send message Joined: 6 Sep 08 Posts: 119 Credit: 14,641,719 RAC: 17,631	Message 36998 - Posted: 10 Oct 2018, 11:38:43 UTC - in response to Message 36985. Last modified: 10 Oct 2018, 11:46:39 UTC Many thanks for the comments. ... CPU usage of lhc VBox WUs should be planned using a calculation factor of 1.3. Thus a 2 core CPU like the ones you use to run LHCb would be able to run a 1-core setup. They will, it's just that they used (still do sometimes) to run 2 core jobs without obvious problems. To test if your hosts are able to deal with a 2-core setup you may monitor your host's "load average" using top or htop. The values should not be much higher than the number of CPU cores. Otherwise it indicates a too busy system. I'm sure you're right. There are four hosts here that normally run LHCb (along with others) - stayed up late to last night to check... 1 Running LHCb (2 core) and TNGrid jobs, load average 1.2 - 1.5 The LHCb was running OK, although, as others have commented, the CPU time looks a bit low. 2. Running LHCb(single core) and TNGrid. Load average 1.2 - 1.3 3 Running two TNGrid, load average 2.01 - 2.1 4. Running TNGrid and Theory (single core). Load average 2.01 If your system is too busy you may consider to run your VBox tasks as a 1-core setup. I don't know it these loads are too high, but they're running single core now so I'll see what happens. I expect that the loads depend on the exact nature of the job being processed at the time and the hosts are just too close to their limit. It will be a few days before I know. ID: 36998 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2710 Credit: 292,033,908 RAC: 145,348	Message 36999 - Posted: 10 Oct 2018, 12:52:11 UTC - in response to Message 36998. LHCb is ATM not good for benchmarking as there are long idle periods that pull the load average down. Usually a 2-core LHCb VM provides 2 independent job slots each of them requests a full core for 80-100 minutes on up to date CPUs. They should be idle only during the upload of intermediate results (50-90 MB each). Extended idle phases point out a problem - most likely within the CERN infrastructure. ID: 36999 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1923 Credit: 149,459,673 RAC: 143,551	Message 37000 - Posted: 10 Oct 2018, 16:08:03 UTC - in response to Message 36999. Extended idle phases point out a problem - most likely within the CERN infrastructure. and I am surprised that it hasn't been fixed yet, after it already exists for quite a while, and some postings here have pointed out this problem. ID: 37000 · Reply Quote

bronco Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0	Message 37001 - Posted: 10 Oct 2018, 16:36:59 UTC - in response to Message 37000. Extended idle phases point out a problem - most likely within the CERN infrastructure. and I am surprised that it hasn't been fixed yet, after it already exists for quite a while, and some postings here have pointed out this problem. They have the compute power and network bandwidth to generate and distribute many thousands of tasks to misconfigured hosts that haven't returned a valid result in months while reliable hosts starve for LHCb jobs. Some combination of laziness and incompetence, I guess. ID: 37001 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1923 Credit: 149,459,673 RAC: 143,551	Message 37029 - Posted: 15 Oct 2018, 6:47:46 UTC I am really surprised that no-one at LHC has noticed yet - despite of the positings here - that something has been going wrong with the LHCb tasks for quite a while. Still, CPU time is not more that 10-15% of the total runtime of a task. ID: 37029 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1923 Credit: 149,459,673 RAC: 143,551	Message 37059 - Posted: 18 Oct 2018, 16:53:43 UTC - in response to Message 37029. I just noticed another "particular" LHCb task: it has been runnig for more than 16 hours now, with processor time 41 minutes :-((( No-one at LHC@home able to look into this? ID: 37059 · Reply Quote

LHC@home