Thread 'Odd runtime recording'

Author	Message
Harri Liljeroos Send message Joined: 28 Sep 04 Posts: 795 Credit: 64,542,521 RAC: 31,364	Message 36012 - Posted: 23 Jul 2018, 18:08:47 UTC One of my tasks recorded odd run times. Here: https://lhcathome.cern.ch/lhcathome/result.php?resultid=202498608 The actual run time was about 13 hours but Boinc has recorded about 33 hours for CPU time and 25 hours for elapsed time. I use BoincTasks as a Boinc Manager replacement and it shows on History elapsed time 13:05 hours and CPU time 33:01. While running, the CPU time increased about 3-4 times the speed of elapsed time. Only one CPU core was active for this task like it should, so why the increased CPU time? Two other LHCb tasks that were running at the same time recorded normal times about 12 hours. The received credit is about two times what the other two tasks received (about the same what I was receiving a few weeks back for 12 hour tasks when I started to run LHCb on this host). ID: 36012 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2724 Credit: 299,274,558 RAC: 24,542	Message 36015 - Posted: 23 Jul 2018, 19:02:56 UTC - in response to Message 36012. are lots of blank lines in the log and messages like: [pre]2018-07-23 19:19:51 (3808): Powering off VM. 2018-07-23 19:24:52 (3808): VM did not power off when requested. 2018-07-23 19:24:52 (3808): VM was successfully terminated.[/pre] This points out a computer that is struggling very hard with it's load. Very long runtimes usually indicate a (temporary) overload or priority problems. Your CPU has 8 cores and it has to feed 2 GPUs. Thus it should be able to run up to 5 native tasks or 4 VBox tasks concurrently. The numbers can change over the time depending on the currently running project mix and calculation phases. You may consider to reduce the number of concurrently running tasks to find out the individual best value. ID: 36015 · Reply Quote

Harri Liljeroos Send message Joined: 28 Sep 04 Posts: 795 Credit: 64,542,521 RAC: 31,364	Message 36019 - Posted: 23 Jul 2018, 22:08:57 UTC - in response to Message 36015. I don't think that the computer is struggling at all, the CPU usage is steady 70-80%, the disk reads and writes are at low level. The actual runtime of the task was normal 13 hours, Boinc just calculated and recorded it wrong. The task was only using one CPU core for that particular task and the CPU time by Windows Task Manager was about 12 hours. But I'll reduce the number of concurrent tasks just to see if I can get rid of the empty lines and VM power off failures. PS. I looked through several Windows hosts that have been running LHCb tasks and all of them show the empty lines in stderr and VM power off failures. So maybe that is just a Windows application thing? ID: 36019 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1957 Credit: 158,811,880 RAC: 52,076	Message 36024 - Posted: 24 Jul 2018, 5:13:54 UTC - in response to Message 36015. ]There are lots of blank lines in the log and messages like: [pre]2018-07-23 19:19:51 (3808): Powering off VM. 2018-07-23 19:24:52 (3808): VM did not power off when requested. 2018-07-23 19:24:52 (3808): VM was successfully terminated.[/pre][/quote] the notice "VM did not power off when requested" seems generally to be an odd one. I see it in all my stderr's for LHCb, regardless of on which of my machines. No idea why this shows up everytime. ID: 36024 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2724 Credit: 299,274,558 RAC: 24,542	Message 36025 - Posted: 24 Jul 2018, 6:08:03 UTC Harri Liljeroos wrote: ... So maybe that is just a Windows application thing ... The OS inside the VM is the same for all hosts, linux. I saw blank lines also on a linux host in the past whenever the host was overcommitted. My guess: There are some system messages that are filtered, but not completely. Erich56 wrote: the notice "VM did not power off when requested" seems generally to be an odd one. I see it in all my stderr's for LHCb, regardless of on which of my machines. No idea why this shows up everytime. My guess: This is an early indicator for an upcoming overload. As I don't run a windows host I can't test it but you may consider to run a reduced number of tasks for a while to see if the message changes. Harri Liljeroos wrote: ... the CPU usage is steady 70-80% ... Should be OK but "CPU usage" doesn't seem to be the right value for monitoring. On linux you would be able to use "load average". Thus my Threadripper system (32 cores) runs at full load with 25-26 concurrent CPU tasks from different projects although nearly all disk I/O is configured to run directly in RAM. Another indicator could be the temperature curve of your GPU. At least on my hosts I notice immediate temperature drops whenever the system is too busy to keep the GPU at full load. The drops disappear when I reduce the number of running CPU tasks. ID: 36025 · Reply Quote

Harri Liljeroos Send message Joined: 28 Sep 04 Posts: 795 Credit: 64,542,521 RAC: 31,364	Message 36051 - Posted: 25 Jul 2018, 21:21:43 UTC OK, I made a test where I reduced the load on the Host I am talking about. Originally Boinc was allowed to use 6 out of 8 CPU cores where 2 cores were used to feed the 2 GPUs running Seti and/or Einstein tasks (one task per GPU). The rest of the CPU cores (4) were running 3 LHCb tasks and one sixtrack or CPDN task. First I reduced the available CPU cores to 5 which left 2 GPU tasks and 3 CPU tasks running. The stderr of a finished LHCb tasks did not change, still showing empty lines and VM was not powered off when requested but was terminated successfully after 5 minutes. I paused the GPU calculations and CPDN project and made LHC to not request new work. Finally only one LHCb task was the only load on the machine, here: https://lhcathome.cern.ch/lhcathome/result.php?resultid=202801465. The result and stderr of that task is like the previous ones, so the load on the computer has nothing to do on the empty lines in the stderr or powering off the VM problems. ID: 36051 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2285 Credit: 178,823,324 RAC: 773	Message 36052 - Posted: 26 Jul 2018, 4:17:25 UTC Helo Harri, your Computer are running very good. Have taken a look into your Computer-list. Only download-errors from the time when all had them. For me, running mostly Atlas only and use all CPU's without problems. Have Boinc 7.12.1 with Virtualbox 5.2.16. The blank lines coming from the program, I think. Why shall they coming from problems? Yes, you have right! Let it crunshing :-)) ID: 36052 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2724 Credit: 299,274,558 RAC: 24,542	Message 36053 - Posted: 26 Jul 2018, 5:36:31 UTC Well, Harri's tests show that it's not the number of concurrently running LHC VMs that cause blank lines or "VM did not power off ...". On the other side there are lots of computers - linux and windows - that do not show that behaviour. To me it still looks like a local problem but as long as the computer delivers good results it may not be worth to spend more time for investigation. Just follow maeax's suggestion and let it crunch. ID: 36053 · Reply Quote

Magic Quantum Mechanic Send message Joined: 24 Oct 04 Posts: 1276 Credit: 94,930,515 RAC: 52,568	Message 36054 - Posted: 26 Jul 2018, 5:45:50 UTC - in response to Message 36053. I agree with Axel and Stefan ID: 36054 · Reply Quote

Harri Liljeroos Send message Joined: 28 Sep 04 Posts: 795 Credit: 64,542,521 RAC: 31,364	Message 36056 - Posted: 26 Jul 2018, 8:46:00 UTC - in response to Message 36053. Well, Harri's tests show that it's not the number of concurrently running LHC VMs that cause blank lines or "VM did not power off ...". On the other side there are lots of computers - linux and windows - that do not show that behaviour. To me it still looks like a local problem but as long as the computer delivers good results it may not be worth to spend more time for investigation. Just follow maeax's suggestion and let it crunch. Yep, that's my plan. The heatwave we have been suffering here in Finland and other Nordic countries requires me to reduce the crunching to night time only. But I will let the LHCb tasks always to finish before suspending Boinc. ID: 36056 · Reply Quote

Magic Quantum Mechanic Send message Joined: 24 Oct 04 Posts: 1276 Credit: 94,930,515 RAC: 52,568	Message 36065 - Posted: 26 Jul 2018, 16:14:05 UTC - in response to Message 36056. Last modified: 26 Jul 2018, 16:16:22 UTC Yep, that's my plan. The heatwave we have been suffering here in Finland and other Nordic countries requires me to reduce the crunching to night time only. But I will let the LHCb tasks always to finish before suspending Boinc. It is my hottest time of year here too Harri I only have one GPU card running this year but have its fan turned up so it runs cool enough and a big fan at the window blows across the 8 desktop pc's in this upstairs room (hottest part of my house) CPU's only run too hot when you have a bad CPU fan or dust and the only time in all my years that I had a CPU get too hot was when the fan died and I wasn't home so I lost that one but this was way back when we had single cores and 13GB HD and less than one GB ram on a Windows 98SE When it is 90 degrees outside that window fan isn't feeling very cool when I go up there but when I check the CPU and GPU temps they are in the proper temp and my three newer 8-core pc's never get a hot CPU so the newer CPU's and basic coolers are better than the ones before this. The best thing is after the Sun goes down they get cold air until about noon the next day. ID: 36065 · Reply Quote