Message boards : LHCb Application : Heartbeat error - VirtualBox Tasks failing
Message board moderation

To post messages, you must log in.

AuthorMessage
Ron S

Send message
Joined: 31 Jul 05
Posts: 4
Credit: 2,957,886
RAC: 0
Message 28861 - Posted: 14 Feb 2017, 2:30:38 UTC

Hello,
After returning to BOINC after half a year and a newly installed Windows 7 I see nearly all my tasks failing that use the VM. Even other projects like Atlas or Cosmology nrver use the CPU but are shown as active.

The error always is:
2017-02-12 02:09:03 (1956): VM Heartbeat file specified, but missing.
2017-02-12 02:09:03 (1956): VM Heartbeat file specified, but missing file system status. (errno = '2')

I am not very knowledgeable about VMs, but I can say that Virtualbox is not generally broken as I can run a Virtual Windows 7 installation without a problem.

Here are my results over the last days:
LHCb: 66/70 failed 4/70 valid
CMS: 7/7 failed
Theory:9/9 failed

Here some examples for WUs that failed:
LHCb: http://lhcathomeclassic.cern.ch/sixtrack/result.php?resultid=119686103
cms: http://lhcathomeclassic.cern.ch/sixtrack/result.php?resultid=119797929
Theory: http://lhcathomeclassic.cern.ch/sixtrack/result.php?resultid=119801022

I checked the logs that I could find and the same errors appear in all of them:

Vbox.log:
ERROR [COM]: aRC=VBOX_E_INVALID_VM_STATE (0x80bb0002) aIID={872da645-4a9b-1727-bee2-5585105b9eed} aComponent={ConsoleWrap} aText={Invalid machine state Paused when checking if the guest entered the ACPI mode)}, preserve=false aResultDetail=0

VboxSVC.log:
ERROR [COM]: aRC=E_ACCESSDENIED (0x80070005) aIID={b2547866-a0a1-4391-8b86-6952d82efaa0} aComponent={MachineWrap} aText={The object functionality is limited}, preserve=false aResultDetail=0

ERROR [COM]: aRC=E_FAIL (0x80004005) aIID={b2547866-a0a1-4391-8b86-6952d82efaa0} aComponent={MachineWrap} aText={Runtime error opening 'B:\Boinc UserData\slots\7\boinc_12312894217b2bed\boinc_12312894217b2bed.vbox' for reading: -103(Path not found.).
00:21:01.125925 F:\tinderbox\win-5.1\src\VBox\Main\src-server\MachineImpl.cpp[745] (long __cdecl Machine::i_registeredInit(void))}, preserve=false aResultDetail=0

However I don't really understand what is going on and the errors seem quite unlogical to me. For example I can open the directory in this error message without problem, so the path exists, assuming it is the same location from which the vm is run, as boinc deletes all the files immediately after the crash and i can't check anymore: "Runtime error opening 'B:\Boinc UserData\slots\7\boinc_12312894217b2bed\boinc_12312894217b2bed.vbox' for reading: -103(Path not found.)"

These errors happened with this version boinc_7.6.33_windows_x86_64_vbox. I tried to update to the actual version of Virtualbox, which didn't solve the problem at all. Insalling the extension pack and reinstalling Virtualbox 5.1 in a new programdirectory finally broke everything. Tasks started in Boinc, being active the whole time while they never even appeared in the Virtualbox Manager.

Now I am back to boinc_7.6.33_windows_x86_64_vbox and back to the same errors as mentioned above.

I hope someone can help me because this affects half of my projects.
ID: 28861 · Report as offensive     Reply Quote
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 598
Credit: 373,584,383
RAC: 42,059
Message 28863 - Posted: 14 Feb 2017, 6:34:22 UTC
Last modified: 14 Feb 2017, 6:35:26 UTC

Make sure you have the latest version of virtual box, it needs to be at least 5.1.

The heat beat errors are often network related, do you see any network problems?

Try to start with just theory project as these are the easiest to get working.
ID: 28863 · Report as offensive     Reply Quote
Ron S

Send message
Joined: 31 Jul 05
Posts: 4
Credit: 2,957,886
RAC: 0
Message 28868 - Posted: 14 Feb 2017, 13:37:40 UTC
Last modified: 14 Feb 2017, 14:07:36 UTC

Thank you for your help.
I will switch to theory tasks for testing then.

Is there a way to totally uninstall Boinc so that it forgets everything? Each time I reinstall it, it remembers my custom folder location. I would like to fully erase all the settings to see if something screwed up there.

Also, how to close BOINC so that there are no leftover processes? For example I always have to manually close boinctray and some vbox processes that stay in memory I guess.

As for network problems, yes I do have some, however never when the Heartbeat errors occurred. Also the pings in the logfiles seem to succeed. If I look at the VMs through BOINC often the last two lines are "[Debug] HTCondor Ping" and then the response is "[Debug] 0". Is that good or not?
Also how would I check that in a reliable way?

So I don't know if the rest I write now is even related, still it is a big problem for me in general. My internet stops working regularly in an odd way. Sites likes google still work at the normal speed, but 90% of the sites are so slow that they are unreachable and result in network errors. However opeeing the cached version of google mostly works. If I manage to open Speedtest.net it shows me a speed between 1/10 to 1/100 of my original speed.

I tried everything I could think of but nothing helps. It happens with all computers on the modem, freshly installed or not, running solo or with others, wifi or cable, smartphone, laptop or PC..
MY ISP came and exchanged the Modem/Router and checked the network traffic, all was fine. They suspect it being an old cable with too much noise on it, however while the signal on one channel is often low end of acceptable or slightly below, each time this happens it is well within good levels.

The only other thing I found on the net is a possible DNS errors, which allows only already cached websites to be opened. However this is way above my understanding of networks and I have no idea what to do or what to check. I don't even know if I should check the DNS in the Modem/Router or on the computer, as all attached devices are affected.
ID: 28868 · Report as offensive     Reply Quote
Ron S

Send message
Joined: 31 Jul 05
Posts: 4
Credit: 2,957,886
RAC: 0
Message 28874 - Posted: 14 Feb 2017, 18:14:50 UTC

Update:
Cliff notes

I have boinc_7.6.33_windows_x86_64 and VirtualBox Version 5.1.14 r112924 (Qt5.6.2) installed.

I can now run as many Theory Tasks as I want as long as I don't start them all at once. I also got 2 LHCb tasks running. I will test more LHCb tasks after the 9 Theory and 2 LHCb tasks finshed succesfully.

If I can't find the bottleneck that prevents me from starting multiple VMs at once, is there any way to queue them? Is there a way in BOINC or VirtualBox to let VMs/tasks only start in 30second intervals of each other?
Or if this is not possible, is there a way to limit the maximum tasks LHC@home can run at once? As in really setting the number of tasks, not some vague CPU usage percentage.

---------------------
The more detailed info

If I start multiple tasks at once, then most of them never use any CPU after the initializing period. Even pausing and unpausing doesn't get them to work. However if I reset the VM of the task then it will start up.

So I am fairly certain that there is some bottleneck somewhere which prevents the VMs from running if I start multiple ones at once.

My LHCb tasks didn't crash with a Windows error, but didn't use the CPU and the VM said this:

This made me doubt my HDD, as I put an old one in just for BOINC, thinking I would throw it away when it fails. It makes noticable sounds when all VMs try to start at once. So I put my Boinc Directory on an SSD.

As I did two steps in one (changing HDD to SSD and starting tasks one after another), I can't say for sure if the SSD helped or not. However as tasks still fail when I start them all at once, it is probably another bottleneck.
On the other hand I get no more errors in the logs and the vm starts like this:


So I am back to my question from my last post, how can I check my network connection in a reliable way? I will exclude the HDD as possible reason after the running tasks finish ok.
ID: 28874 · Report as offensive     Reply Quote
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 598
Credit: 373,584,383
RAC: 42,059
Message 28877 - Posted: 14 Feb 2017, 18:53:43 UTC

Is there a way to totally uninstall Boinc so that it forgets everything? Each time I reinstall it, it remembers my custom folder location. I would like to fully erase all the settings to see if something screwed up there.


Not sure I never tried to total uninstall?


Also, how to close BOINC so that there are no leftover processes? For example I always have to manually close boinctray and some vbox processes that stay in memory I guess.


The VBOX SVC stay's all the time, you can kill it if you need to? I do what you do quit BOINC then watch in the VBOX GUI for them all to save.


As for network problems, yes I do have some, however never when the Heartbeat errors occurred. Also the pings in the logfiles seem to succeed. If I look at the VMs through BOINC often the last two lines are "[Debug] HTCondor Ping" and then the response is "[Debug] 0". Is that good or not?
Also how would I check that in a reliable way?


0 is normally good result so I assume the same here, I imagine that it takes 600sec +/- from the problem for it to be recognized.


I have some error but the rate it less than 15% and this seems acceptable to the project, if you have more than one WU running then it's not so much wasted time.

If I can't find the bottleneck that prevents me from starting multiple VMs at once, is there any way to queue them? Is there a way in BOINC or VirtualBox to let VMs/tasks only start in 30second intervals of each other?
Or if this is not possible, is there a way to limit the maximum tasks LHC@home can run at once? As in really setting the number of tasks, not some vague CPU usage percentage.


I think they naturally separate themselves over time.

You can limit the number of tasks in the project setting or via app_config.xml files locally.
ID: 28877 · Report as offensive     Reply Quote
Ron S

Send message
Joined: 31 Jul 05
Posts: 4
Credit: 2,957,886
RAC: 0
Message 28893 - Posted: 15 Feb 2017, 20:39:06 UTC

I finally identified the culprit.
It was the HDD. It was a 200GB WD2000JD Sata1 from 2003. Probably BOINC just needs newer/faster HDDs as the HDD itself is without errors.

I also limited the maximal simultaneous tasks per project through the app_config.xml as explained here http://boinc.berkeley.edu/trac/wiki/ProjectOptions#Joblimitsadvanced

Now I can run:
-all LHC@home tasks, multi- and singlecore. (didn't get any CMS tasks yet)
-all LHC@home_dev tasks except the multicore CMS tasks
-No Cosmology@home multicore tasks
-all projects that don't use VirtualBox

There is still some problem with multicore tasks, but as some projects run and others don't, I assume it is not a problem on my side. Anyway I don't have the time to troubleshoot for days, so I just disabled them.

Still I feel like the applications waste/use too many resources.
Even on my SSD starting 6 tasks at once will result in 1 not starting correctly. I use multiple projects that use VirtualBox, so this is much more likely to happen. On the old HDD even 1 task was already too much to start up correctly.

Do some minimum requirement for HDDs with VirtualBox exist? Because I would really prefer putting BOINC on a HDD instead of on one of the SSDs. I need them for working and not already half occupied by BOINC all the time.
ID: 28893 · Report as offensive     Reply Quote

Message boards : LHCb Application : Heartbeat error - VirtualBox Tasks failing


©2020 CERN