Message boards : Number crunching : Missing heartbeat file errors
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 7 · Next

AuthorMessage
Jesse Viviano

Send message
Joined: 12 Feb 14
Posts: 72
Credit: 4,639,155
RAC: 0
Message 28215 - Posted: 23 Dec 2016, 2:08:56 UTC
Last modified: 23 Dec 2016, 2:11:12 UTC

Has anyone like a project administrator, developer, or scientist who has a computer that can successfully process work units tried a project reset after letting that computer's LHC@home queue drain completely? If there is a missing file in the VM or at the CVMFS, I am guessing that work units will begin to fail after such a reset because the missing file that was in the machine's cache before the reset file will be wiped by the reset, and such an administrator, developer, or scientist could examine the problem and start debugging it. If work units continue to process successfully, there might be something on the endpoints that this project is incompatible with.
ID: 28215 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1417
Credit: 9,441,018
RAC: 1,047
Message 28216 - Posted: 23 Dec 2016, 6:59:39 UTC - in response to Message 28215.  
Last modified: 23 Dec 2016, 7:05:55 UTC

Has anyone like a project administrator, developer, or scientist who has a computer that can successfully process work units tried a project reset after letting that computer's LHC@home queue drain completely?
...

Or maybe a Volunteer tester? ;)
Just started up 4 new VM's and then read your message.
I'll not wait over 12 hours to do this for you.
I'll let the 4 running jobs finish and abort the tasks and reset the project.
This test should be even better cause I'm just like you not located at CERN.
I will test with only Theory selected in my preferences and ask 4 tasks at once and
btw I'm connected to this project with the SSL URL https://lhcathome.cern.ch/lhcathome/ for the case you're not.
TTYL
ID: 28216 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1417
Credit: 9,441,018
RAC: 1,047
Message 28217 - Posted: 23 Dec 2016, 8:39:43 UTC

All 4 tasks started fine and are processing events after the project reset.

456 LHC@home 23 Dec 09:24:31 Resetting project
468 LHC@home 23 Dec 09:25:56 work fetch resumed by user
469 LHC@home 23 Dec 09:25:57 update requested by user
470 LHC@home 23 Dec 09:25:58 Master file download succeeded
471 LHC@home 23 Dec 09:26:03 Sending scheduler request: Requested by user.
472 LHC@home 23 Dec 09:26:03 Requesting new tasks for CPU
473 LHC@home 23 Dec 09:26:04 Scheduler request completed: got 4 new tasks
474 LHC@home 23 Dec 09:26:06 Started download of vboxwrapper_26196_windows_x86_64.exe
475 LHC@home 23 Dec 09:26:06 Started download of Theory_2016_10_05.xml
476 LHC@home 23 Dec 09:26:09 Finished download of Theory_2016_10_05.xml
477 LHC@home 23 Dec 09:26:09 Started download of Theory_2016_11_02.vdi
478 LHC@home 23 Dec 09:26:10 work fetch suspended by user
479 LHC@home 23 Dec 09:26:10 Finished download of vboxwrapper_26196_windows_x86_64.exe
480 LHC@home 23 Dec 09:26:10 Started download of vboxwrapper_26196_windows_x86_64.pdb
481 LHC@home 23 Dec 09:26:12 Finished download of vboxwrapper_26196_windows_x86_64.pdb
486 LHC@home 23 Dec 09:26:54 Finished download of Theory_2016_11_02.vdi
498 LHC@home 23 Dec 09:28:01 Starting task Theory_26330_1482478079.592544_0
500 LHC@home 23 Dec 09:29:45 Starting task Theory_28266_1482478681.135884_0
502 LHC@home 23 Dec 09:30:17 Starting task Theory_29238_1482478981.344183_0
504 LHC@home 23 Dec 09:33:50 Starting task Theory_29239_1482478981.375142_0
ID: 28217 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1060
Credit: 7,737,445
RAC: 2,437
Message 28218 - Posted: 23 Dec 2016, 9:02:10 UTC - in response to Message 28213.  

Hmm, no, that was a 400 not a 404. Guess I messed up.

Did you hit "return" twice after the last input line?

GET http://cvmfs-stratum-one.cern.ch/cvmfs/grid.cern.ch/vc/sbin/bootstrap HTTP/1.0<hit return>
<hit return again>


I can´t reproduce the 400.

Yes, I might have. At least one of my GET requests seemed to hang until I hit carriage-return again.
ID: 28218 · Report as offensive     Reply Quote
Jesse Viviano

Send message
Joined: 12 Feb 14
Posts: 72
Credit: 4,639,155
RAC: 0
Message 28230 - Posted: 23 Dec 2016, 15:36:39 UTC - in response to Message 28216.  

I was connected to the HTTPS URL.
ID: 28230 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1060
Credit: 7,737,445
RAC: 2,437
Message 28232 - Posted: 23 Dec 2016, 17:09:32 UTC - in response to Message 28216.  

Has anyone like a project administrator, developer, or scientist who has a computer that can successfully process work units tried a project reset after letting that computer's LHC@home queue drain completely?
...

Or maybe a Volunteer tester? ;)
Just started up 4 new VM's and then read your message.
I'll not wait over 12 hours to do this for you.
I'll let the 4 running jobs finish and abort the tasks and reset the project.
This test should be even better cause I'm just like you not located at CERN.
I will test with only Theory selected in my preferences and ask 4 tasks at once and
btw I'm connected to this project with the SSL URL https://lhcathome.cern.ch/lhcathome/ for the case you're not.
TTYL

OK, I've just done a project reset on my 12-core (whose official name is so convoluted that I find it easier to remember its IP address...). Let's see if it starts having problems.
ID: 28232 · Report as offensive     Reply Quote
Jesse Viviano

Send message
Joined: 12 Feb 14
Posts: 72
Credit: 4,639,155
RAC: 0
Message 28233 - Posted: 23 Dec 2016, 18:28:47 UTC

I had another possible hypothesis: After my VDSL was swapped out with gigabit fiber, I noticed that applications that tried to geolocate my router's public IP address often failed. I also noticed some posts at https://lhcathome.cern.ch/vLHCathome/forum_thread.php?id=1933 and https://lhcathome.cern.ch/vLHCathome/forum_thread.php?id=1934 which try to direct users to either CERN or FNAL depending on location. Could this be what is causing my failure?
ID: 28233 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 728
Credit: 48,829,131
RAC: 20,906
Message 28234 - Posted: 23 Dec 2016, 19:03:50 UTC

I brought the problematic laptop home and fired it up. It connected to my wifi and while I was not paying attention to it, it had connected to LHC@home and downloaded a Theory task. It had started working on it and Windows firewall had popped up with a message that it had blocked vbox from connecting to net. I have not had that happen when I was using it in the office and when all vbox tasks failed. I allowed the connection and vbox had already been crunching for almost an hour on the Theory task. I didn't find any problem on the logs and left it crunching.

Now the laptop has been crunching the Theory task for a couple of hours, lets see if it can finish the task successfully.
ID: 28234 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2528
Credit: 253,722,201
RAC: 51,175
Message 28236 - Posted: 23 Dec 2016, 19:50:21 UTC

Normal WU logs include the following lines:

Guest Log: 00:00:00.041603 automount VBoxServiceAutoMountWorker: Shared folder "shared" was mounted to "/media/sf_shared"
Guest Log: [INFO] Mounting the shared directory
Guest Log: [INFO] Shared directory mounted, enabling vboxmonitor



JV´s logs include only the first line:

Guest Log: 00:00:00.032028 automount VBoxServiceAutoMountWorker: Shared folder "shared" was mounted to "/media/sf_shared"



This looks like mounting the shared folder does not fully complete and the therefore the heartbeat file can´t be written. Then, after a timeout of 10 minutes a watchdog closes the VM.


To use shared folders it´s mandatory to install the VirtualBox Extension Pack that fits to the main software version.
Could you check if there are remains from an older Extension Pack in your registry?
Could you check the access rights to the BOINC folders (including parent dirs)?
Are all of your BOINC folders located on the same partition or distributed among different partitions?
ID: 28236 · Report as offensive     Reply Quote
Jesse Viviano

Send message
Joined: 12 Feb 14
Posts: 72
Credit: 4,639,155
RAC: 0
Message 28237 - Posted: 23 Dec 2016, 20:37:34 UTC - in response to Message 28236.  

How would I check the registry for remains of an old extension pack in the registry?
The access rights of the BOINC folders look fine.
The folders are in the same partition.
ID: 28237 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2528
Credit: 253,722,201
RAC: 51,175
Message 28238 - Posted: 23 Dec 2016, 21:29:08 UTC - in response to Message 28237.  

The access rights of the BOINC folders look fine.
The folders are in the same partition.

o.k.

How would I check the registry for remains of an old extension pack in the registry?

Use regedit to check your registry.
Search for keys related to VirtualBox.
If there are version numbers that don´t match your currently installed VirtualBox version there is something messed up.

If so:
- uninstall VirtualBox
- recheck the registry
- delete old VirtualBox keys if they are still there
- install VirtualBox again
ID: 28238 · Report as offensive     Reply Quote
Jesse Viviano

Send message
Joined: 12 Feb 14
Posts: 72
Credit: 4,639,155
RAC: 0
Message 28239 - Posted: 23 Dec 2016, 22:14:16 UTC - in response to Message 28238.  

I found nothing regarding inconsistent version numbers between my installation of VirtualBox and the VirtualBox Extension Pack like that.
ID: 28239 · Report as offensive     Reply Quote
Jesse Viviano

Send message
Joined: 12 Feb 14
Posts: 72
Credit: 4,639,155
RAC: 0
Message 28240 - Posted: 24 Dec 2016, 1:20:28 UTC - in response to Message 28233.  

I have one more possible hypothesis based on the message I am replying to. Could there be something wrong with the replication between the CERN CVMFS repository and the FNAL CVMFS repository?
ID: 28240 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 728
Credit: 48,829,131
RAC: 20,906
Message 28241 - Posted: 24 Dec 2016, 10:48:55 UTC

My laptop has now finished and validated its first Theory task and now working on a CMS task. So in this case the heartbeat file errors were due to firewall/ISP/network connection. Originally the computer could not start any vbox tasks in the office due to the problem in the thread title but at home it started working after accepting the communication in the Windows firewall pop-up window. The window never showed up in the office.
ID: 28241 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1060
Credit: 7,737,445
RAC: 2,437
Message 28242 - Posted: 24 Dec 2016, 11:34:42 UTC - in response to Message 28232.  

OK, I've just done a project reset on my 12-core (whose official name is so convoluted that I find it easier to remember its IP address...). Let's see if it starts having problems.


The machine is running tasks successfully.
ID: 28242 · Report as offensive     Reply Quote
Jesse Viviano

Send message
Joined: 12 Feb 14
Posts: 72
Credit: 4,639,155
RAC: 0
Message 28286 - Posted: 28 Dec 2016, 18:51:22 UTC - in response to Message 28184.  

An example that gets a file from CERN:

telnet lhchomeproxy.cern.ch 3125
Trying 128.142.168.203...
Connected to lhchomeproxy.cern.ch.
Escape character is '^]'.
GET http://cvmfs-stratum-one.cern.ch/cvmfs/grid.cern.ch/.cvmfspublished HTTP/1.0
Host: cvmfs-stratum-one.cern.ch

HTTP/1.1 200 OK
Date: Wed, 21 Dec 2016 16:43:59 GMT
Accept-Ranges: bytes
Content-Length: 515
Content-Type: application/x-cvmfs
Server: Apache/2.4.6 (CentOS) mod_wsgi/3.4 Python/2.7.5
Expires: Wed, 21 Dec 2016 16:46:07 GMT
Cache-Control: max-age=120
X-Cache: MISS from front08.cern.ch
X-Cache-Lookup: HIT from front08.cern.ch:80
Age: 24
X-Cache: HIT from vocms0323.cern.ch/3
Via: 1.1 front08.cern.ch (squid/3.5.20), 1.1 vocms0323.cern.ch/3 (squid/frontier-squid-3.5.22-2.1)
Connection: close

Followed by the contents of .cvmfspublished
Connection closed by foreign host.


Now /cvmfs/grid.cern.ch/vc/sbin/bootstrap
telnet lhchomeproxy.cern.ch 3125
Trying 128.142.168.203...
Connected to lhchomeproxy.cern.ch.
Escape character is '^]'.
GET http://cvmfs-stratum-one.cern.ch/cvmfs/grid.cern.ch/vc/sbin/bootstrap HTTP/1.0
Host: cvmfs-stratum-one.cern.ch

HTTP/1.1 404 Not Found
Date: Wed, 21 Dec 2016 16:47:47 GMT
Server: Apache/2.4.6 (CentOS) mod_wsgi/3.4 Python/2.7.5
Content-Length: 234
Content-Type: text/html; charset=iso-8859-1
X-Cache: MISS from front15.cern.ch
X-Cache-Lookup: MISS from front15.cern.ch:80
X-Cache: MISS from vocms0323.cern.ch/3
Via: 1.1 front15.cern.ch (squid/3.5.20), 1.1 vocms0323.cern.ch/3 (squid/frontier-squid-3.5.22-2.1)
Connection: close

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>404 Not Found</title>
</head><body>
<h1>Not Found</h1>
<p>The requested URL /cvmfs/grid.cern.ch/vc/sbin/bootstrap was not found on this server.</p>
</body></html>
Connection closed by foreign host.

Either an incomplete/wrong URL or the file is not where it should be.

I have repeated these experiments from my machine and have replicated the results (the first experiment gets a file while the second experiment gets an HTTP 404 Not Found error). Could someone please check the CVMFS to fix the HTTP 404 error I am getting for the file my work units are trying to retrieve?
ID: 28286 · Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer

Send message
Joined: 20 Jun 14
Posts: 380
Credit: 238,712
RAC: 0
Message 28292 - Posted: 29 Dec 2016, 23:20:12 UTC - in response to Message 28286.  
Last modified: 29 Dec 2016, 23:20:28 UTC

Could someone please check the CVMFS to fix the HTTP 404 error I am getting for the file my work units are trying to retrieve?


AFAIK this is not how CVMFS works. The first file was a CMVFS meta data file so it found it at that URL. However, the other file was not found as the file system does map to URLs in that way. As CVMFS support versions, the URL will probably be some random URL containing a few hashes.
ID: 28292 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2528
Credit: 253,722,201
RAC: 51,175
Message 28293 - Posted: 30 Dec 2016, 9:43:13 UTC

So what is the reason for the missing heartbeat file?

1. Wrong permissions of the shared folder?
- Unrealistic as the shared folder is part of the slot the WU runs in and this structure is set up when the WU starts.

2. Wrong VirtualBox installation?
- Older extension pack after an update, ...

3. CVMFS problems, missing files etc.?
- Why not for more users?

4. Unrecognized firewall issues?
- Harri Liljeroos´s tests point in this direction
https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4052&postid=28241
- firewall messages may have been switched off in the past

5. Other errors that cause the heartbeat error as side effect?
- probably not caused by a slow internet connection as Jesse Viviano´s should be fast enough
ID: 28293 · Report as offensive     Reply Quote
Jesse Viviano

Send message
Joined: 12 Feb 14
Posts: 72
Credit: 4,639,155
RAC: 0
Message 28300 - Posted: 31 Dec 2016, 3:27:02 UTC
Last modified: 31 Dec 2016, 3:30:06 UTC

I just did a complete wipe and rebuild of my computer, and I am getting the same errors as before. I just did a complete wipe and rebuild of the computer, and it did not solve my CVMFS errors. I have to conclude that there is a server problem, a VirtualBox version problem, a client software problem, or a problem with AT&T Fiber.
ID: 28300 · Report as offensive     Reply Quote
m

Send message
Joined: 6 Sep 08
Posts: 118
Credit: 12,560,503
RAC: 318
Message 28301 - Posted: 31 Dec 2016, 15:13:53 UTC - in response to Message 28293.  
Last modified: 31 Dec 2016, 15:37:34 UTC

The results that happen to be on the project database at the moment for my hosts show:

36 theory OK
3 missing heartbeat file
1 failed HTCondor ping.

6 CMS OK
1 missing heartbeat file.

(There's one or two on the old vLHC 32bit theory tasks, too; but I've been changing things around there)

From odd comments made in these forums I think that it affects many users.
So what is the reason for the missing heartbeat file?

From admittedly unscientific observations here it seems that:-

1. Whilst not caused by low internet speed (although this must only be true up to a point) it is related to network activity. Maybe slow or unreliable DNS, timeouts mount up, routers may need to forward DNS requests. If there are multiple hosts, starting them at intervals helps.

2. The way that particular routers handle NAT - the time for which incoming connections are accepted for example. Some people have opened the appropriate ports in their firewall rules so this may not be a problem for everyone. ISPs using carrier grade NAT to eke out IPv4 addresses probably doesn't help.

3. The limited number of simultaneous connections handled by the router. I think mine is limited to 5000 or 6000, after which it starts dropping packets, although I haven't noticed any failures fom this.

On a broader note, it seems to me that projects using HTCondor have not been set up having regard to the vagaries of common domestic internet connections, not UK ADSL at any rate. I'm sure that, if the various timeouts etc. could be suitably adjusted, and suspend/resume made a bit more robust, current projects could run as smoothly as the original LHC (SixTrack/T4T) The Gold Standard.
ID: 28301 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 7 · Next

Message boards : Number crunching : Missing heartbeat file errors


©2024 CERN