Message boards : Number crunching : Computation error: "Hearbeat file missing"
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,352,535
RAC: 101,555
Message 35191 - Posted: 8 May 2018, 17:50:28 UTC

today, I re-activated a Vista Notebook for LHC crunching.
Vbox is 5.1.34 (together with the extension pack).

Due to limited RAM (3 GB), I chose to run Theory Simulation (which works well on a Netbook with also 3 GB RAM).

However, every Theory tasks fails after about 10 minutes ("computation error"),
stderr says:
VM Heartbeat file specified, but missing.
VM Heartbeat file specified, but missing file system status. (errno = '2')

the complete data for such a task can be seen here:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=189848819

Could one the experts please tell me what the problem is?
ID: 35191 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,916,955
RAC: 138,132
Message 35192 - Posted: 8 May 2018, 18:12:39 UTC - in response to Message 35191.  

Your log contains the following line:
"00:00:02.314000 SUPR0QueryVTCaps -> VERR_SVM_DISABLED"
Did you enable (if possible) VT-x/AMD-v?
ID: 35192 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,352,535
RAC: 101,555
Message 35193 - Posted: 8 May 2018, 18:43:49 UTC - in response to Message 35192.  
Last modified: 8 May 2018, 18:49:16 UTC

Did you enable (if possible) VT-x/AMD-v?
That's what I tried to do in the first place (after having checked the CPU specs as to whether virtualization is possible at all with the AMD Turion X2 Ultra Dual-Core).
However, strange enough, there was no way to see or to access any CPU settings in the BIOS. The only thing which the BIOS allowed was to perform a RAM test and a hard disk test. Really strange, never before I saw such a limited BIOS (maybe there is something wrong with the BIOS).

Then I remembered that when I tried a LHC VM task on another PC some time ago, the task failed after 1 or 2 seconds - since the virtualizuation was NOT switched on.
So my guess was since here the task started properly and ran for about 10 minutes, the virtualization would be switched on by default (everything else would not make sense, with the BIOS not making it possible to switch the virtualization on or off).

However, if you are sure that the log entry "00:00:02.314000 SUPR0QueryVTCaps -> VERR_SVM_DISABLED" means that the virtualization is not switched on, then I have a problem :-(

EDIT: in the Windows Task Manager, VBoxHeadless.exe runs with 50% CPU usage (my BOINC settings are that way) for the first 10 minutes.
How could this happen if there was no virtualization?
What I suspect is that maybe it is the Windows Firewall that makes problems. What I did was to put all the VBOX exe-files as an exception - but this did not help :-(
ID: 35193 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2071
Credit: 156,090,946
RAC: 103,877
Message 35194 - Posted: 8 May 2018, 18:50:44 UTC
Last modified: 8 May 2018, 18:54:58 UTC

Erich,
your Computer show Win7. Have such a CPU as a HomeServer2011 with Windows Server 2008.
SVM must be enabled, so you must find the BIOS Parameter. Otherwhise no Chance for AMD-V (in BIOS => SVM)

Edit: If you can run 32-Bit instead of X64, SVM is not necessary for Theory.
ID: 35194 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,352,535
RAC: 101,555
Message 35195 - Posted: 8 May 2018, 19:10:36 UTC - in response to Message 35194.  
Last modified: 8 May 2018, 19:13:54 UTC

Erich, your Computer show Win7.
Maeax, guess you mixed up my computers. it's the one with ID: 10544654. Definitely Vista (32-bit).

After having played around with the BIOS, I finally found a way to access the CPU settings (each of the sections of the BIOS must be accessed by a different f... key). Virtualization was indeed switched off, so I switched it on.
However, after rebooting the system, it says that it can not be started, and now tries to do some repair. So let's see what will happen. Maybe all I can do with this notebook is to discard it :-(
ID: 35195 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,352,535
RAC: 101,555
Message 35198 - Posted: 9 May 2018, 5:12:19 UTC - in response to Message 35195.  

However, after rebooting the system, it says that it can not be started, and now tries to do some repair. So let's see what will happen. Maybe all I can do with this notebook is to discard it :-(
after a while, a system restore got started, and finally Vista booted the normal way.
However, the VBox was no longer there, so I had to re-install it.
And the Theory tasks seem to run okay. A "heavy birth", so to say :-)
ID: 35198 · Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 1114
Credit: 49,501,728
RAC: 4,157
Message 35200 - Posted: 9 May 2018, 6:39:00 UTC - in response to Message 35198.  

Erich it takes a brave member here to try running a VB task with that old 2-core 32bit beast with Vista and only 2.75 GB ram
ID: 35200 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,916,955
RAC: 138,132
Message 35201 - Posted: 9 May 2018, 7:07:29 UTC - in response to Message 35200.  

But it's very fast:
A successful Theory task in only 1,391.40 seconds.
And most likely not affected by Spectre and Meltdown.
We need more of them.
;-D
ID: 35201 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2071
Credit: 156,090,946
RAC: 103,877
Message 35202 - Posted: 9 May 2018, 7:21:10 UTC

Searching for my Win 3.11 disks ;-)
ID: 35202 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,352,535
RAC: 101,555
Message 35204 - Posted: 9 May 2018, 11:01:11 UTC
Last modified: 9 May 2018, 11:02:00 UTC

hi, guys, you all are right, of course.
It's kind of a test to find out whether this old notebook still can fulfill some purpose :-)

So I thought before it suffers a slow dead down in my basement, let's try. And from what it looks right now: it works :-)

In fact, I also re-activated an about 8 years old netbook - see here: https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10542973

Until 2 days ago, it crunched Sixtrack; since then, Therory :-)))

P.S. I also should look for an old Win 3.11 machine somwhere in my basement :-)))
ID: 35204 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,916,955
RAC: 138,132
Message 35205 - Posted: 9 May 2018, 11:39:58 UTC - in response to Message 35204.  

I wonder if an old C64 attached via acoustic coupler would work.
I guess they may at least need to extend the deadline for that.
:-)))
ID: 35205 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,352,535
RAC: 101,555
Message 35215 - Posted: 10 May 2018, 5:21:53 UTC

last night, there occurred a similar problem on another of my machines, with a Theory task, after more than 9 hours:

2018-05-10 04:42:26 (7228): VM Heartbeat file specified, but missing heartbeat.

The stderr details can be seen here: https://lhcathome.cern.ch/lhcathome/result.php?resultid=190499426

Anyone any idea what happened?
BTW, this machine had successfully crunched a few Theory tasks before.
ID: 35215 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2071
Credit: 156,090,946
RAC: 103,877
Message 35216 - Posted: 10 May 2018, 6:06:37 UTC
Last modified: 10 May 2018, 6:07:44 UTC

2018-05-10 04:23:54 (7228): VM state change detected. (old = 'running', new = 'paused')
2018-05-10 04:24:04 (7228): VM state change detected. (old = 'paused', new = 'running')
2018-05-10 04:42:26 (7228): VM Heartbeat file specified, but missing heartbeat.
2018-05-10 04:42:26 (7228): Powering off VM.
2018-05-10 04:42:27 (7228): Successfully stopped VM.
ID: 35216 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,352,535
RAC: 101,555
Message 35217 - Posted: 10 May 2018, 6:21:30 UTC - in response to Message 35216.  

hm, strange thing.
why would the VM pause, all of a sudden? And 10 seconds later, it was running again, but still, 18 minutes later, no heartbeat - - - ???
ID: 35217 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,352,535
RAC: 101,555
Message 35222 - Posted: 10 May 2018, 14:39:14 UTC - in response to Message 35217.  
Last modified: 10 May 2018, 14:53:51 UTC

hm, strange thing.
why would the VM pause, all of a sudden? And 10 seconds later, it was running again, but still, 18 minutes later, no heartbeat - - - ???
same thing happened again after 8 1/2 hours running, with another task:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=190854556
The VM paused for 10 seconds, and this obviously killed the "hearbeat" (whatever "hearbeat" means).

What also noticed in the Hypervisor System log (in the lower part of the stderr):
usbLibDevCfgDrGet: DeviceIoControl 1 fail dwErr (31)

Does anyone have any explanation for this strange behaviour?
ID: 35222 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2071
Credit: 156,090,946
RAC: 103,877
Message 35223 - Posted: 10 May 2018, 15:28:04 UTC

Erich,
is something paused seen in the Windows-logs or Boinc-logs?
ID: 35223 · Report as offensive     Reply Quote
bronco

Send message
Joined: 13 Apr 18
Posts: 443
Credit: 8,438,885
RAC: 0
Message 35224 - Posted: 10 May 2018, 20:31:11 UTC - in response to Message 35222.  


The VM paused for 10 seconds, and this obviously killed the "hearbeat" (whatever "hearbeat" means).

What also noticed in the Hypervisor System log (in the lower part of the stderr):
usbLibDevCfgDrGet: DeviceIoControl 1 fail dwErr (31)

Does anyone have any explanation for this strange behaviour?


I don't have an explanation for the behavior but I know what the heartbeat is. BOINC needs a quick and easy way to know if the project's app is still running so the app periodically touches a disk file in .../../slots/shared. The period is ~60 seconds and the file is named heartbeat. Touching a file either creates the file or, if the file already exists, updates the 'last accessed" datetime.

heartbeat is zero-length (ie. it's empty). You should be able to see heartbeat in file manager. If not then either your username doesn't have the required permissions or it doesn't exist. Watch it's last accessed datetime and notice that it increases by 60 secs every 60 secs.

So the VMwrapper (or possibly the VM itelf? ) touches heartbeat every 60 secs. BOINC periodically looks at heartbeat. At that point the possible scenarios go something like this:

1. If BOINC cannot see heartbeat then it can reasonably assume either the app/VM never started or the app/VM deleted heartbeat then died.

2. If BOINC can see heartbeat and it's last accessed datetime has incremented from the previous time it looked at heartbeat then BOINC can be reasonably sure the VM still lives.

3. If heartbeat exists but last access datetime has not incremented then BOINC could assume the VM lives but it's hung or it could assume it's dead but it didn't delete heartbeat before it died.

In your case it sounds like BOINC is terminating the task because it has no heartbeat and appears to be dead.
ID: 35224 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,352,535
RAC: 101,555
Message 35225 - Posted: 11 May 2018, 5:34:50 UTC

Guys, thanks for your comments so far.

On occasion of the monthly Windows Update, I rebooted the notebook yesterday, and after that the next Theory task got finished successfully.
So maybe something had hung in the system before.
I'll keep my fingers crossed, anyway.
ID: 35225 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,352,535
RAC: 101,555
Message 35237 - Posted: 12 May 2018, 14:51:18 UTC

Maeax wrote:
Erich, is something paused seen in the Windows-logs or Boinc-logs?
I looked up the Windows event logs and the BOINC logs - nothing there, unfortunately :-(

With the most recent Theory task, the problem came up again - the task failed after some 8 1/2 hours - see here:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=191133224

After the next task has run for about two hours, I stopped BOINC and rebooted the system (what I did 2 days ago on occasion of the Windows Update, and afterwords a few tasks went well). So I will see whether this task now will fail, or not.

The bad thing is that Theory is the only VB task which I can run on this 3GB memory netbook (out of which only 2.75GB can be used). All other VB tasks require too much RAM; and Sixtrack tasks are not available most of the time :-(

So, at the bottom line, I may need to switch to projects other than LHC :-(
ID: 35237 · Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 1114
Credit: 49,501,728
RAC: 4,157
Message 35243 - Posted: 12 May 2018, 19:31:12 UTC
Last modified: 12 May 2018, 19:57:12 UTC

Erich, as you know Windows 10 likes to tell your computer to reboot and it is easy to miss catching it before it happens and VB doesn't like to reboot while the VB task is still running.

Some running tasks will fail when that happens and some might restart right where they were and some even start over at 0% but keep the long running time it already had.

But the main thing to do if you are going to reboot while you have some task(s) running you need to suspend them and then check the VB Manager before the reboot to see if it says the tasks are *saved* and not still running in the VB Manager and not while it still says it is *saving*

The heartbeat problem usually is happening because of the running reboot or loss of the internet connection.
That is one thing easier about Sixtracks since you can run them without having the internet connected but not with VB tasks since the tasks running and the server have to communicate.

And at times the server just cant connect to your ports when Testing CVMFS connection to lhchomeproxy.cern.ch and in the first few minutes getaddrinfo: Temporary failure in name resolution so you cant get the server X509 credential from LHC@home.

(btw those errors you had on May 8th were on the server end and not yours)
ID: 35243 · Report as offensive     Reply Quote
1 · 2 · Next

Message boards : Number crunching : Computation error: "Hearbeat file missing"


©2024 CERN