Message boards : Number crunching : Missing heartbeat file errors
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · Next

AuthorMessage
Jesse Viviano

Send message
Joined: 12 Feb 14
Posts: 72
Credit: 4,639,155
RAC: 0
Message 28380 - Posted: 7 Jan 2017, 19:21:26 UTC - in response to Message 28355.  
Last modified: 7 Jan 2017, 20:20:50 UTC

I compared the DNS settings in the .vbox file you linked to in the message I am replying to and the .vbox files in the failing BOINC tasks, and noticed that the build .vbox file is set to not use a DNS proxy, but the BOINC tasks do use a DNS proxy. Could this be causing problems?

EDIT: I just came up with a possible wild guess of a hypothesis of what could be going wrong. After I read https://www.virtualbox.org/manual/ch09.html#nat_host_resolver_proxy, I thought that Windows 10's networking stack could be intercepting all DNS replies and dumping them into Windows's DNS cache. If that is the case, then the DNS reply never gets back to the VM. The way to solve that is to use the host's DNS resolver as a DNS proxy as seen in the link I just wrote. I know that this is a wild guess, but it could not hurt to try using the host's DNS resolver as a DNS proxy. Using the host's DNS resolver as a DNS proxy would also cut down on DNS traffic from multiple virtual machines running at the same time because the host's DNS cache would catch DNS requests to the same domain names which would be redundant to the same real machine but not to each VM. If this change succeeds, I would recommend this to other VirtualBox projects like ATLAS@home to enhance software firewall compatibility because they might expect DNS queries from the host OS's DNS resolver and nowhere else, and to cut DNS traffic that would be redundant coming from the same machine because the host DNS resolver could have the requested information in its cache that could be reused by multiple virtual machines.
ID: 28380 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1263
Credit: 8,420,582
RAC: 5,321
Message 28381 - Posted: 7 Jan 2017, 20:05:01 UTC

I'm guessing that Theory doesn't have jobs at the moment.
Also a task within BOINC didn't get jobs -> https://lhcathome.cern.ch/lhcathome/result.php?resultid=110944582

Difficult to test without jobs.
ID: 28381 · Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer

Send message
Joined: 20 Jun 14
Posts: 371
Credit: 238,712
RAC: 0
Message 28382 - Posted: 7 Jan 2017, 20:14:49 UTC - in response to Message 28381.  
Last modified: 7 Jan 2017, 20:15:09 UTC

I'm guessing that Theory doesn't have jobs at the moment.
Also a task within BOINC didn't get jobs -> https://lhcathome.cern.ch/lhcathome/result.php?resultid=110944582

Difficult to test without jobs.


Jobs are in the queue but it looks like the external firewall again. Should be open in a few hours once the configuration has been refreshed.
ID: 28382 · Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer

Send message
Joined: 20 Jun 14
Posts: 371
Credit: 238,712
RAC: 0
Message 28383 - Posted: 7 Jan 2017, 20:37:20 UTC - in response to Message 28380.  

I just came up with a possible wild guess


It does not sound so wild. Going back to the original error message, I managed to reproduce it just by starting the VM with the network disabled on my machine. Two things don't make sense to me at the moment. The first is why this file is not found as it should be cached, the second is that if it is network related, why does the cached image fail but CernVM with the ISO work? It is difficult for me to investigate further until I am back in the office on Monday as I don't have a Windows 10 machine at home. However, I am pretty sure that this error is due to a networking issue between the VM and host machine. Setting cable="false" was just another way to disable the network on my machine so could just be a red herring.
ID: 28383 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1263
Credit: 8,420,582
RAC: 5,321
Message 28384 - Posted: 7 Jan 2017, 20:39:17 UTC - in response to Message 28382.  

Should be open in a few hours once the configuration has been refreshed.

I think my eyes are closed then.
I'll test tomorrow again with the cable="false" addition to the adapter slot="0"
ID: 28384 · Report as offensive     Reply Quote
Jesse Viviano

Send message
Joined: 12 Feb 14
Posts: 72
Credit: 4,639,155
RAC: 0
Message 28385 - Posted: 7 Jan 2017, 23:18:53 UTC - in response to Message 28383.  
Last modified: 7 Jan 2017, 23:44:57 UTC

Since the test CernVM virtual machines with the ISO that I set up that worked on my machine were not set up to use a DNS proxy, disabling NAT proxy could be a good short term fix. However, that could cause problems if the machine's DNS server changes due to an expired DHCP lease and the replacement DHCP lease containing other DNS servers that need to be used for some reason like maintenance for the DNS servers in the old DHCP lease; or the user switching internet connections because the other internet connection failed.

There are three ways the DNS could be set up in VirtualBox according to what I read in the manual at https://www.virtualbox.org/manual/ch09.html: the default method where VirtualBox passes the computer's configured DNS server addresses to the virtual machine; the original NAT proxy method where VirtualBox passes a private DNS server IP address to the virtual machine to intercept the DNS traffic and performs twice NAT on that traffic (where twice NAT is Cisco's terminology for translating both the source and destination IP addresses in one transaction); and the alternate NAT proxy method in which VirtualBox passes a private DNS server IP address to the virtual machine to intercept the DNS traffic and then VirtualBox has to reconstruct the DNS request in the host OS's DNS API to go through the OS's DNS resolver as a proxy. The last method is the most work for VirtualBox, but it will cut down on network traffic because the host OS's DNS cache will resolve DNS requests from multiple virtual machines looking for the same server and could improve firewall compatibility because some host-based firewalls would expect DNS traffic to come only from the OS's resolver and might block alternate programs trying to do DNS themselves as attempts to get around the firewall. It is possible that the twice NAT DNS traffic could be misrouted into the Windows DNS cache instead of the process doing the twice NAT. It is also possible that VirtualBox has a bug with the twice NAT DNS proxy method that only is exposed with Windows 10.
EDIT: Fix wrong URL
ID: 28385 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1263
Credit: 8,420,582
RAC: 5,321
Message 28389 - Posted: 8 Jan 2017, 8:36:24 UTC - in response to Message 28384.  
Last modified: 8 Jan 2017, 8:43:50 UTC

Should be open in a few hours once the configuration has been refreshed.

I think my eyes are closed then.
I'll test tomorrow again with the cable="false" addition to the adapter slot="0"

New jobs are flowing.

Added the cable="false" and started the VM.
This time I got a job running, but my manual addition was removed from the vbox-file.

Vbox.log: 00:00:02.994004 CableConnected <integer> = 0x0000000000000001 (1)

Time for the users with Windows 10 to figure out what's going wrong. IMO it's a network issue.
ID: 28389 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1263
Credit: 8,420,582
RAC: 5,321
Message 28390 - Posted: 8 Jan 2017, 9:54:56 UTC

With a next try with cable="false" the addition stayed in the vbox-file, but more important, the VM is running normal and processing events.
ID: 28390 · Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer

Send message
Joined: 20 Jun 14
Posts: 371
Credit: 238,712
RAC: 0
Message 28397 - Posted: 8 Jan 2017, 20:37:19 UTC - in response to Message 28390.  

It looks like the issue was caused by the Window 10 update KB3206632 and is fixed by KB3213522.

I will follow up on why we get the heartbeat error and not the no network error.
ID: 28397 · Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 1112
Credit: 49,476,392
RAC: 6,583
Message 28399 - Posted: 8 Jan 2017, 20:42:06 UTC - in response to Message 28397.  

It looks like the issue was caused by the Window 10 update KB3206632 and is fixed by KB3213522.

I will follow up on why we get the heartbeat error and not the no network error.



That can not be the reason since MANY Windows 10 crunchers here have been turning in Valid tasks every day before and after ANY Win 10 Updates
Volunteer Mad Scientist For Life
ID: 28399 · Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer

Send message
Joined: 20 Jun 14
Posts: 371
Credit: 238,712
RAC: 0
Message 28400 - Posted: 8 Jan 2017, 20:46:26 UTC - in response to Message 28399.  

That can not be the reason since MANY Windows 10 crunchers here have been turning in Valid tasks every day before and after ANY Win 10 Updates


There may be some specific combination of settings that contribute to this but it is clear for the information from Microsoft that something relating to virtualization was broken in the update.
ID: 28400 · Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 1112
Credit: 49,476,392
RAC: 6,583
Message 28402 - Posted: 8 Jan 2017, 21:00:17 UTC - in response to Message 28400.  
Last modified: 8 Jan 2017, 21:30:18 UTC

https://lhcathome.cern.ch/lhcathome/top_hosts.php

Well with Windows 10 we don't get to skip Updates and have to do these Updates and can only set the Update Settings to wait according to the personal settings for no more than 2 days.

How could this possibly only happen or be the reason why a hand full of Win 10 pc's had this problem and all the ones on the stats page show that was never a problem here or at vLHC or vLHC-dev or Atlas?

Microsoft never told any Insiders about this mysterious problem that only happened to a couple people trying these VB tasks.

I never had that problem here and the same is seen on the stats page.

Edit: Windows 10 Updates are nothing like the Updates on the previous OS versions......you no longer can just skip the Updates or wait a few weeks or months to do the Updates.......if your pc is running it will force the Update and also the reboot (I did alpha-beta testing for Microsoft and live NW of Redmond WA) the only time you don't have these Updates is installing from a disc and THEN you will be given all the Updates and either reboot yourself OR Microsoft will do the reboot......this is nothing like the previous versions and I have 4 Win 10's since alpha-beta and 2 Win7's and a XP Pro still running (I also did the alpha-beta tests for XP back in 2000 for Microsoft)
ID: 28402 · Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer

Send message
Joined: 20 Jun 14
Posts: 371
Credit: 238,712
RAC: 0
Message 28403 - Posted: 8 Jan 2017, 21:40:13 UTC - in response to Message 28402.  


How could this possibly only happen or be the reason why a hand full of Win 10 pc's had this problem and all the ones on the stats page show that was never a problem here or at vLHC or vLHC-dev or Atlas?

Microsoft never told any Insiders about this mysterious problem that only happened to a couple people trying these VB tasks.


Microsoft did inform everyone that they introduced an issue with
update KB3206632 as on that page it states.


This update contains an issue that affects virtualization-based security (VBS). The issue is fixed in the following update:


So my guess is that depending on some local configuration, this VBS issue is affecting some but not others. Our investigation point to this being DNS related. However, if applying the KB3213522 update solves the issue for the original submitters, then there is no need to spend time investigating this further.
ID: 28403 · Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 1112
Credit: 49,476,392
RAC: 6,583
Message 28404 - Posted: 8 Jan 2017, 22:03:16 UTC - in response to Message 28403.  

Microsoft did not say it was a VB task problem.

I showed the evidence in the stats that it was not the problem.
I also said how Windows 10 Updates are NOTHING like the previous Microsoft OS's

AND that I have worked with Microsoft for 18 years.

I don't use Linux and never give tips or info about that but I do know how Windows OS's work and have had mine since it was first tested and......well several members here and myself have been doing these VB tasks with a Windows OS since the very first day and did all the testing over these 6 years.

THIS is what was said at Microsoft about that Update last December 13th

https://support.microsoft.com/en-us/help/4004227/windows-10-update-kb3206632

Windows 10 Updates do not wait for a running computer to d/l or reboot.

And I know for a fact that members running these VB tasks with Win 10 didn't get their tasks Valid by getting that update or in case it is another reason tried......we do this around the world so it also is not because of how far from the server we could be.

My tip is to get a computer with Windows 10 and try it from a new install without the current updates or with all the updates......and you will find this to be a fact. (all your links just go to another thread here saying the same thing)
ID: 28404 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 990
Credit: 6,264,307
RAC: 191
Message 28405 - Posted: 8 Jan 2017, 23:13:28 UTC - in response to Message 28402.  

Edit: Windows 10 Updates are nothing like the Updates on the previous OS versions......you no longer can just skip the Updates or wait a few weeks or months to do the Updates.......if your pc is running it will force the Update and also the reboot

There is a way around it *if* you are connected by WiFi -- simply tell Windows that you are on a metered connexion, and it won't force updates. I have seen mention of how to also do this on a wired connexion, but it needs a Registry hack.
ID: 28405 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 674
Credit: 43,006,223
RAC: 17,221
Message 28409 - Posted: 9 Jan 2017, 10:58:04 UTC

Today I brought my laptop back to the office and now I am again having the problem mentioned in the title of this thread.

Originally I installed the vbox while I was in the office but had no success running any vbox tasks. Then I took the laptop home and it has worked there for the past couple of weeks without the missing heartbeat problem. Now in the office network all tasks are failing again.

So far two LHCb tasks have failed during the first few minutes and now a Theory task has run about an hour but the CPU usage is < 1% so I believe it will also fail eventually.

This lapyop has Win7 x64 so the Win10 update bug doesn't apply here but it something in the network connection/firewall settings that is causing this error in this case.

Here's a part of the stderr of the currently running Theory task:
2017-01-09 11:51:20 (14360): Guest Log: [DEBUG] Testing network connection to cern.ch on port 80
2017-01-09 11:51:20 (14360): Guest Log: [DEBUG] Connection to cern.ch 80 port [tcp/http] succeeded!
2017-01-09 11:51:20 (14360): Guest Log: [DEBUG] 0
2017-01-09 11:51:20 (14360): Guest Log: [DEBUG] Testing CVMFS connection to lhchomeproxy.cern.ch on port 3125
2017-01-09 11:51:31 (14360): Guest Log: [DEBUG] nc: connect to lhchomeproxy.cern.ch port 3125 (tcp) timed out: Operation now in progress
2017-01-09 11:51:31 (14360): Guest Log: nc: connect to lhchomeproxy.cern.ch port 3125 (tcp) timed out: Operation now in progress
2017-01-09 11:51:31 (14360): Guest Log: [DEBUG] 1


I cannot find the "missing heartbeat" words in any of the logs or in the stderr.txt file but I presume that it will be shown in the task info when the result has been sent back to Cern.

I will wait and see if this task also fails and then stop doing any vbox tasks when I am in the office.
ID: 28409 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1263
Credit: 8,420,582
RAC: 5,321
Message 28410 - Posted: 9 Jan 2017, 13:10:48 UTC - in response to Message 28409.  

I will wait and see if this task also fails and then stop doing any vbox tasks when I am in the office.


At your office one or more ports to be used by the applications seem to be closed. Ask your network manager.

http://lhcathome.web.cern.ch/test4theory/my-firewall-complaining-which-ports-does-project-use
ID: 28410 · Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer

Send message
Joined: 20 Jun 14
Posts: 371
Credit: 238,712
RAC: 0
Message 28411 - Posted: 9 Jan 2017, 13:12:09 UTC - in response to Message 28409.  

It looks like your machine cannot connect to the squid proxy (lhchomeproxy.cern.ch) on port 3125. This may be due to the firewall settings for your office. The VM should shutdown at this point with an error message rather than being killed by the heartbeat mechanism. Will investigate why not.
ID: 28411 · Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer

Send message
Joined: 20 Jun 14
Posts: 371
Credit: 238,712
RAC: 0
Message 28412 - Posted: 9 Jan 2017, 13:47:08 UTC - in response to Message 28404.  

From what I can tell, VBS is part of Device Guard so it may only affect those who are using this or configuring it a certain way.
ID: 28412 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 674
Credit: 43,006,223
RAC: 17,221
Message 28415 - Posted: 10 Jan 2017, 7:35:42 UTC - in response to Message 28411.  

Thanks for the info.

The Theory task finished and validated after 18 hours with only about 1 minute of CPU time. I think I let the firewall be like it is and let it just run SixTrack tasks and other projects. I can only use one CPU core and no GPUs on it because it has poor cooling.
ID: 28415 · Report as offensive     Reply Quote
Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · Next

Message boards : Number crunching : Missing heartbeat file errors


©2024 CERN