Message boards : LHCb Application : Odd runtime recording
Message board moderation

To post messages, you must log in.

AuthorMessage
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 674
Credit: 43,168,451
RAC: 16,096
Message 36012 - Posted: 23 Jul 2018, 18:08:47 UTC

One of my tasks recorded odd run times. Here: https://lhcathome.cern.ch/lhcathome/result.php?resultid=202498608
The actual run time was about 13 hours but Boinc has recorded about 33 hours for CPU time and 25 hours for elapsed time. I use BoincTasks as a Boinc Manager replacement and it shows on History elapsed time 13:05 hours and CPU time 33:01. While running, the CPU time increased about 3-4 times the speed of elapsed time. Only one CPU core was active for this task like it should, so why the increased CPU time?

Two other LHCb tasks that were running at the same time recorded normal times about 12 hours. The received credit is about two times what the other two tasks received (about the same what I was receiving a few weeks back for 12 hour tasks when I started to run LHCb on this host).
ID: 36012 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 223,040,867
RAC: 136,830
Message 36015 - Posted: 23 Jul 2018, 19:02:56 UTC - in response to Message 36012.  

There are lots of blank lines in the log and messages like:
2018-07-23 19:19:51 (3808): Powering off VM.
2018-07-23 19:24:52 (3808): VM did not power off when requested.
2018-07-23 19:24:52 (3808): VM was successfully terminated.

This points out a computer that is struggling very hard with it's load.
Very long runtimes usually indicate a (temporary) overload or priority problems.

Your CPU has 8 cores and it has to feed 2 GPUs.
Thus it should be able to run up to 5 native tasks or 4 VBox tasks concurrently.
The numbers can change over the time depending on the currently running project mix and calculation phases.

You may consider to reduce the number of concurrently running tasks to find out the individual best value.
ID: 36015 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 674
Credit: 43,168,451
RAC: 16,096
Message 36019 - Posted: 23 Jul 2018, 22:08:57 UTC - in response to Message 36015.  

I don't think that the computer is struggling at all, the CPU usage is steady 70-80%, the disk reads and writes are at low level. The actual runtime of the task was normal 13 hours, Boinc just calculated and recorded it wrong. The task was only using one CPU core for that particular task and the CPU time by Windows Task Manager was about 12 hours. But I'll reduce the number of concurrent tasks just to see if I can get rid of the empty lines and VM power off failures.

PS. I looked through several Windows hosts that have been running LHCb tasks and all of them show the empty lines in stderr and VM power off failures. So maybe that is just a Windows application thing?
ID: 36019 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,483,743
RAC: 104,419
Message 36024 - Posted: 24 Jul 2018, 5:13:54 UTC - in response to Message 36015.  

There are lots of blank lines in the log and messages like:
2018-07-23 19:19:51 (3808): Powering off VM.
2018-07-23 19:24:52 (3808): VM did not power off when requested.
2018-07-23 19:24:52 (3808): VM was successfully terminated.
the notice "VM did not power off when requested" seems generally to be an odd one.
I see it in all my stderr's for LHCb, regardless of on which of my machines.
No idea why this shows up everytime.
ID: 36024 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 223,040,867
RAC: 136,830
Message 36025 - Posted: 24 Jul 2018, 6:08:03 UTC

Harri Liljeroos wrote:
... So maybe that is just a Windows application thing ...

The OS inside the VM is the same for all hosts, linux.
I saw blank lines also on a linux host in the past whenever the host was overcommitted.

My guess:
There are some system messages that are filtered, but not completely.


Erich56 wrote:
the notice "VM did not power off when requested" seems generally to be an odd one.
I see it in all my stderr's for LHCb, regardless of on which of my machines.
No idea why this shows up everytime.

My guess:
This is an early indicator for an upcoming overload.
As I don't run a windows host I can't test it but you may consider to run a reduced number of tasks for a while to see if the message changes.

Harri Liljeroos wrote:
... the CPU usage is steady 70-80% ...

Should be OK but "CPU usage" doesn't seem to be the right value for monitoring.
On linux you would be able to use "load average".
Thus my Threadripper system (32 cores) runs at full load with 25-26 concurrent CPU tasks from different projects although nearly all disk I/O is configured to run directly in RAM.

Another indicator could be the temperature curve of your GPU.
At least on my hosts I notice immediate temperature drops whenever the system is too busy to keep the GPU at full load.
The drops disappear when I reduce the number of running CPU tasks.
ID: 36025 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 674
Credit: 43,168,451
RAC: 16,096
Message 36051 - Posted: 25 Jul 2018, 21:21:43 UTC

OK, I made a test where I reduced the load on the Host I am talking about. Originally Boinc was allowed to use 6 out of 8 CPU cores where 2 cores were used to feed the 2 GPUs running Seti and/or Einstein tasks (one task per GPU). The rest of the CPU cores (4) were running 3 LHCb tasks and one sixtrack or CPDN task. First I reduced the available CPU cores to 5 which left 2 GPU tasks and 3 CPU tasks running. The stderr of a finished LHCb tasks did not change, still showing empty lines and VM was not powered off when requested but was terminated successfully after 5 minutes. I paused the GPU calculations and CPDN project and made LHC to not request new work. Finally only one LHCb task was the only load on the machine, here: https://lhcathome.cern.ch/lhcathome/result.php?resultid=202801465.

The result and stderr of that task is like the previous ones, so the load on the computer has nothing to do on the empty lines in the stderr or powering off the VM problems.
ID: 36051 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2071
Credit: 156,192,791
RAC: 103,819
Message 36052 - Posted: 26 Jul 2018, 4:17:25 UTC

Helo Harri,
your Computer are running very good. Have taken a look into your Computer-list.
Only download-errors from the time when all had them.

For me, running mostly Atlas only and use all CPU's without problems.
Have Boinc 7.12.1 with Virtualbox 5.2.16.
The blank lines coming from the program, I think. Why shall they coming from problems?
Yes, you have right!
Let it crunshing :-))
ID: 36052 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 223,040,867
RAC: 136,830
Message 36053 - Posted: 26 Jul 2018, 5:36:31 UTC

Well, Harri's tests show that it's not the number of concurrently running LHC VMs that cause blank lines or "VM did not power off ...".
On the other side there are lots of computers - linux and windows - that do not show that behaviour.

To me it still looks like a local problem but as long as the computer delivers good results it may not be worth to spend more time for investigation.
Just follow maeax's suggestion and let it crunch.
ID: 36053 · Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 1114
Credit: 49,504,188
RAC: 3,842
Message 36054 - Posted: 26 Jul 2018, 5:45:50 UTC - in response to Message 36053.  

I agree with Axel and Stefan
ID: 36054 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 674
Credit: 43,168,451
RAC: 16,096
Message 36056 - Posted: 26 Jul 2018, 8:46:00 UTC - in response to Message 36053.  

Well, Harri's tests show that it's not the number of concurrently running LHC VMs that cause blank lines or "VM did not power off ...".
On the other side there are lots of computers - linux and windows - that do not show that behaviour.

To me it still looks like a local problem but as long as the computer delivers good results it may not be worth to spend more time for investigation.
Just follow maeax's suggestion and let it crunch.


Yep, that's my plan. The heatwave we have been suffering here in Finland and other Nordic countries requires me to reduce the crunching to night time only. But I will let the LHCb tasks always to finish before suspending Boinc.
ID: 36056 · Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 1114
Credit: 49,504,188
RAC: 3,842
Message 36065 - Posted: 26 Jul 2018, 16:14:05 UTC - in response to Message 36056.  
Last modified: 26 Jul 2018, 16:16:22 UTC



Yep, that's my plan. The heatwave we have been suffering here in Finland and other Nordic countries requires me to reduce the crunching to night time only. But I will let the LHCb tasks always to finish before suspending Boinc.


It is my hottest time of year here too Harri

I only have one GPU card running this year but have its fan turned up so it runs cool enough and a big fan at the window blows across the 8 desktop pc's in this upstairs room (hottest part of my house)

CPU's only run too hot when you have a bad CPU fan or dust and the only time in all my years that I had a CPU get too hot was when the fan died and I wasn't home so I lost that one but this was way back when we had single cores and 13GB HD and less than one GB ram on a Windows 98SE

When it is 90 degrees outside that window fan isn't feeling very cool when I go up there but when I check the CPU and GPU temps they are in the proper temp and my three newer 8-core pc's never get a hot CPU so the newer CPU's and basic coolers are better than the ones before this.

The best thing is after the Sun goes down they get cold air until about noon the next day.
ID: 36065 · Report as offensive     Reply Quote

Message boards : LHCb Application : Odd runtime recording


©2024 CERN