Message boards :
Number crunching :
Computation error: "Hearbeat file missing"
Message board moderation
Author | Message |
---|---|
Send message Joined: 18 Dec 15 Posts: 1693 Credit: 104,910,023 RAC: 60,267 |
today, I re-activated a Vista Notebook for LHC crunching. Vbox is 5.1.34 (together with the extension pack). Due to limited RAM (3 GB), I chose to run Theory Simulation (which works well on a Netbook with also 3 GB RAM). However, every Theory tasks fails after about 10 minutes ("computation error"), stderr says: VM Heartbeat file specified, but missing. VM Heartbeat file specified, but missing file system status. (errno = '2') the complete data for such a task can be seen here: https://lhcathome.cern.ch/lhcathome/result.php?resultid=189848819 Could one the experts please tell me what the problem is? |
Send message Joined: 15 Jun 08 Posts: 2435 Credit: 228,525,867 RAC: 123,501 |
Your log contains the following line: "00:00:02.314000 SUPR0QueryVTCaps -> VERR_SVM_DISABLED" Did you enable (if possible) VT-x/AMD-v? |
Send message Joined: 18 Dec 15 Posts: 1693 Credit: 104,910,023 RAC: 60,267 |
Did you enable (if possible) VT-x/AMD-v?That's what I tried to do in the first place (after having checked the CPU specs as to whether virtualization is possible at all with the AMD Turion X2 Ultra Dual-Core). However, strange enough, there was no way to see or to access any CPU settings in the BIOS. The only thing which the BIOS allowed was to perform a RAM test and a hard disk test. Really strange, never before I saw such a limited BIOS (maybe there is something wrong with the BIOS). Then I remembered that when I tried a LHC VM task on another PC some time ago, the task failed after 1 or 2 seconds - since the virtualizuation was NOT switched on. So my guess was since here the task started properly and ran for about 10 minutes, the virtualization would be switched on by default (everything else would not make sense, with the BIOS not making it possible to switch the virtualization on or off). However, if you are sure that the log entry "00:00:02.314000 SUPR0QueryVTCaps -> VERR_SVM_DISABLED" means that the virtualization is not switched on, then I have a problem :-( EDIT: in the Windows Task Manager, VBoxHeadless.exe runs with 50% CPU usage (my BOINC settings are that way) for the first 10 minutes. How could this happen if there was no virtualization? What I suspect is that maybe it is the Windows Firewall that makes problems. What I did was to put all the VBOX exe-files as an exception - but this did not help :-( |
Send message Joined: 2 May 07 Posts: 2126 Credit: 160,016,997 RAC: 35,565 |
Erich, your Computer show Win7. Have such a CPU as a HomeServer2011 with Windows Server 2008. SVM must be enabled, so you must find the BIOS Parameter. Otherwhise no Chance for AMD-V (in BIOS => SVM) Edit: If you can run 32-Bit instead of X64, SVM is not necessary for Theory. |
Send message Joined: 18 Dec 15 Posts: 1693 Credit: 104,910,023 RAC: 60,267 |
Erich, your Computer show Win7.Maeax, guess you mixed up my computers. it's the one with ID: 10544654. Definitely Vista (32-bit). After having played around with the BIOS, I finally found a way to access the CPU settings (each of the sections of the BIOS must be accessed by a different f... key). Virtualization was indeed switched off, so I switched it on. However, after rebooting the system, it says that it can not be started, and now tries to do some repair. So let's see what will happen. Maybe all I can do with this notebook is to discard it :-( |
Send message Joined: 18 Dec 15 Posts: 1693 Credit: 104,910,023 RAC: 60,267 |
However, after rebooting the system, it says that it can not be started, and now tries to do some repair. So let's see what will happen. Maybe all I can do with this notebook is to discard it :-(after a while, a system restore got started, and finally Vista booted the normal way. However, the VBox was no longer there, so I had to re-install it. And the Theory tasks seem to run okay. A "heavy birth", so to say :-) |
Send message Joined: 24 Oct 04 Posts: 1130 Credit: 49,863,073 RAC: 8,485 |
Erich it takes a brave member here to try running a VB task with that old 2-core 32bit beast with Vista and only 2.75 GB ram |
Send message Joined: 15 Jun 08 Posts: 2435 Credit: 228,525,867 RAC: 123,501 |
But it's very fast: A successful Theory task in only 1,391.40 seconds. And most likely not affected by Spectre and Meltdown. We need more of them. ;-D |
Send message Joined: 2 May 07 Posts: 2126 Credit: 160,016,997 RAC: 35,565 |
Searching for my Win 3.11 disks ;-) |
Send message Joined: 18 Dec 15 Posts: 1693 Credit: 104,910,023 RAC: 60,267 |
hi, guys, you all are right, of course. It's kind of a test to find out whether this old notebook still can fulfill some purpose :-) So I thought before it suffers a slow dead down in my basement, let's try. And from what it looks right now: it works :-) In fact, I also re-activated an about 8 years old netbook - see here: https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10542973 Until 2 days ago, it crunched Sixtrack; since then, Therory :-))) P.S. I also should look for an old Win 3.11 machine somwhere in my basement :-))) |
Send message Joined: 15 Jun 08 Posts: 2435 Credit: 228,525,867 RAC: 123,501 |
I wonder if an old C64 attached via acoustic coupler would work. I guess they may at least need to extend the deadline for that. :-))) |
Send message Joined: 18 Dec 15 Posts: 1693 Credit: 104,910,023 RAC: 60,267 |
last night, there occurred a similar problem on another of my machines, with a Theory task, after more than 9 hours: 2018-05-10 04:42:26 (7228): VM Heartbeat file specified, but missing heartbeat. The stderr details can be seen here: https://lhcathome.cern.ch/lhcathome/result.php?resultid=190499426 Anyone any idea what happened? BTW, this machine had successfully crunched a few Theory tasks before. |
Send message Joined: 2 May 07 Posts: 2126 Credit: 160,016,997 RAC: 35,565 |
2018-05-10 04:23:54 (7228): VM state change detected. (old = 'running', new = 'paused') 2018-05-10 04:24:04 (7228): VM state change detected. (old = 'paused', new = 'running') 2018-05-10 04:42:26 (7228): VM Heartbeat file specified, but missing heartbeat. 2018-05-10 04:42:26 (7228): Powering off VM. 2018-05-10 04:42:27 (7228): Successfully stopped VM. |
Send message Joined: 18 Dec 15 Posts: 1693 Credit: 104,910,023 RAC: 60,267 |
hm, strange thing. why would the VM pause, all of a sudden? And 10 seconds later, it was running again, but still, 18 minutes later, no heartbeat - - - ??? |
Send message Joined: 18 Dec 15 Posts: 1693 Credit: 104,910,023 RAC: 60,267 |
hm, strange thing.same thing happened again after 8 1/2 hours running, with another task: https://lhcathome.cern.ch/lhcathome/result.php?resultid=190854556 The VM paused for 10 seconds, and this obviously killed the "hearbeat" (whatever "hearbeat" means). What also noticed in the Hypervisor System log (in the lower part of the stderr): usbLibDevCfgDrGet: DeviceIoControl 1 fail dwErr (31) Does anyone have any explanation for this strange behaviour? |
Send message Joined: 2 May 07 Posts: 2126 Credit: 160,016,997 RAC: 35,565 |
Erich, is something paused seen in the Windows-logs or Boinc-logs? |
Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0 |
I don't have an explanation for the behavior but I know what the heartbeat is. BOINC needs a quick and easy way to know if the project's app is still running so the app periodically touches a disk file in .../../slots/shared. The period is ~60 seconds and the file is named heartbeat. Touching a file either creates the file or, if the file already exists, updates the 'last accessed" datetime. heartbeat is zero-length (ie. it's empty). You should be able to see heartbeat in file manager. If not then either your username doesn't have the required permissions or it doesn't exist. Watch it's last accessed datetime and notice that it increases by 60 secs every 60 secs. So the VMwrapper (or possibly the VM itelf? ) touches heartbeat every 60 secs. BOINC periodically looks at heartbeat. At that point the possible scenarios go something like this: 1. If BOINC cannot see heartbeat then it can reasonably assume either the app/VM never started or the app/VM deleted heartbeat then died. 2. If BOINC can see heartbeat and it's last accessed datetime has incremented from the previous time it looked at heartbeat then BOINC can be reasonably sure the VM still lives. 3. If heartbeat exists but last access datetime has not incremented then BOINC could assume the VM lives but it's hung or it could assume it's dead but it didn't delete heartbeat before it died. In your case it sounds like BOINC is terminating the task because it has no heartbeat and appears to be dead. |
Send message Joined: 18 Dec 15 Posts: 1693 Credit: 104,910,023 RAC: 60,267 |
Guys, thanks for your comments so far. On occasion of the monthly Windows Update, I rebooted the notebook yesterday, and after that the next Theory task got finished successfully. So maybe something had hung in the system before. I'll keep my fingers crossed, anyway. |
Send message Joined: 18 Dec 15 Posts: 1693 Credit: 104,910,023 RAC: 60,267 |
Maeax wrote: Erich, is something paused seen in the Windows-logs or Boinc-logs?I looked up the Windows event logs and the BOINC logs - nothing there, unfortunately :-( With the most recent Theory task, the problem came up again - the task failed after some 8 1/2 hours - see here: https://lhcathome.cern.ch/lhcathome/result.php?resultid=191133224 After the next task has run for about two hours, I stopped BOINC and rebooted the system (what I did 2 days ago on occasion of the Windows Update, and afterwords a few tasks went well). So I will see whether this task now will fail, or not. The bad thing is that Theory is the only VB task which I can run on this 3GB memory netbook (out of which only 2.75GB can be used). All other VB tasks require too much RAM; and Sixtrack tasks are not available most of the time :-( So, at the bottom line, I may need to switch to projects other than LHC :-( |
Send message Joined: 24 Oct 04 Posts: 1130 Credit: 49,863,073 RAC: 8,485 |
Erich, as you know Windows 10 likes to tell your computer to reboot and it is easy to miss catching it before it happens and VB doesn't like to reboot while the VB task is still running. Some running tasks will fail when that happens and some might restart right where they were and some even start over at 0% but keep the long running time it already had. But the main thing to do if you are going to reboot while you have some task(s) running you need to suspend them and then check the VB Manager before the reboot to see if it says the tasks are *saved* and not still running in the VB Manager and not while it still says it is *saving* The heartbeat problem usually is happening because of the running reboot or loss of the internet connection. That is one thing easier about Sixtracks since you can run them without having the internet connected but not with VB tasks since the tasks running and the server have to communicate. And at times the server just cant connect to your ports when Testing CVMFS connection to lhchomeproxy.cern.ch and in the first few minutes getaddrinfo: Temporary failure in name resolution so you cant get the server X509 credential from LHC@home. (btw those errors you had on May 8th were on the server end and not yours) |
©2024 CERN