Message boards :
Number crunching :
Missing heartbeat file errors
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 7 · Next
Author | Message |
---|---|
Send message Joined: 12 Feb 14 Posts: 72 Credit: 4,639,155 RAC: 0 |
Has anyone like a project administrator, developer, or scientist who has a computer that can successfully process work units tried a project reset after letting that computer's LHC@home queue drain completely? If there is a missing file in the VM or at the CVMFS, I am guessing that work units will begin to fail after such a reset because the missing file that was in the machine's cache before the reset file will be wiped by the reset, and such an administrator, developer, or scientist could examine the problem and start debugging it. If work units continue to process successfully, there might be something on the endpoints that this project is incompatible with. |
Send message Joined: 14 Jan 10 Posts: 1417 Credit: 9,441,018 RAC: 1,047 |
Has anyone like a project administrator, developer, or scientist who has a computer that can successfully process work units tried a project reset after letting that computer's LHC@home queue drain completely? Or maybe a Volunteer tester? ;) Just started up 4 new VM's and then read your message. I'll not wait over 12 hours to do this for you. I'll let the 4 running jobs finish and abort the tasks and reset the project. This test should be even better cause I'm just like you not located at CERN. I will test with only Theory selected in my preferences and ask 4 tasks at once and btw I'm connected to this project with the SSL URL https://lhcathome.cern.ch/lhcathome/ for the case you're not. TTYL |
Send message Joined: 14 Jan 10 Posts: 1417 Credit: 9,441,018 RAC: 1,047 |
All 4 tasks started fine and are processing events after the project reset. 456 LHC@home 23 Dec 09:24:31 Resetting project 468 LHC@home 23 Dec 09:25:56 work fetch resumed by user 469 LHC@home 23 Dec 09:25:57 update requested by user 470 LHC@home 23 Dec 09:25:58 Master file download succeeded 471 LHC@home 23 Dec 09:26:03 Sending scheduler request: Requested by user. 472 LHC@home 23 Dec 09:26:03 Requesting new tasks for CPU 473 LHC@home 23 Dec 09:26:04 Scheduler request completed: got 4 new tasks 474 LHC@home 23 Dec 09:26:06 Started download of vboxwrapper_26196_windows_x86_64.exe 475 LHC@home 23 Dec 09:26:06 Started download of Theory_2016_10_05.xml 476 LHC@home 23 Dec 09:26:09 Finished download of Theory_2016_10_05.xml 477 LHC@home 23 Dec 09:26:09 Started download of Theory_2016_11_02.vdi 478 LHC@home 23 Dec 09:26:10 work fetch suspended by user 479 LHC@home 23 Dec 09:26:10 Finished download of vboxwrapper_26196_windows_x86_64.exe 480 LHC@home 23 Dec 09:26:10 Started download of vboxwrapper_26196_windows_x86_64.pdb 481 LHC@home 23 Dec 09:26:12 Finished download of vboxwrapper_26196_windows_x86_64.pdb 486 LHC@home 23 Dec 09:26:54 Finished download of Theory_2016_11_02.vdi 498 LHC@home 23 Dec 09:28:01 Starting task Theory_26330_1482478079.592544_0 500 LHC@home 23 Dec 09:29:45 Starting task Theory_28266_1482478681.135884_0 502 LHC@home 23 Dec 09:30:17 Starting task Theory_29238_1482478981.344183_0 504 LHC@home 23 Dec 09:33:50 Starting task Theory_29239_1482478981.375142_0 |
Send message Joined: 29 Aug 05 Posts: 1060 Credit: 7,737,445 RAC: 2,437 |
Hmm, no, that was a 400 not a 404. Guess I messed up. Yes, I might have. At least one of my GET requests seemed to hang until I hit carriage-return again. |
Send message Joined: 12 Feb 14 Posts: 72 Credit: 4,639,155 RAC: 0 |
I was connected to the HTTPS URL. |
Send message Joined: 29 Aug 05 Posts: 1060 Credit: 7,737,445 RAC: 2,437 |
Has anyone like a project administrator, developer, or scientist who has a computer that can successfully process work units tried a project reset after letting that computer's LHC@home queue drain completely? OK, I've just done a project reset on my 12-core (whose official name is so convoluted that I find it easier to remember its IP address...). Let's see if it starts having problems. |
Send message Joined: 12 Feb 14 Posts: 72 Credit: 4,639,155 RAC: 0 |
I had another possible hypothesis: After my VDSL was swapped out with gigabit fiber, I noticed that applications that tried to geolocate my router's public IP address often failed. I also noticed some posts at https://lhcathome.cern.ch/vLHCathome/forum_thread.php?id=1933 and https://lhcathome.cern.ch/vLHCathome/forum_thread.php?id=1934 which try to direct users to either CERN or FNAL depending on location. Could this be what is causing my failure? |
Send message Joined: 28 Sep 04 Posts: 728 Credit: 48,829,131 RAC: 20,906 |
I brought the problematic laptop home and fired it up. It connected to my wifi and while I was not paying attention to it, it had connected to LHC@home and downloaded a Theory task. It had started working on it and Windows firewall had popped up with a message that it had blocked vbox from connecting to net. I have not had that happen when I was using it in the office and when all vbox tasks failed. I allowed the connection and vbox had already been crunching for almost an hour on the Theory task. I didn't find any problem on the logs and left it crunching. Now the laptop has been crunching the Theory task for a couple of hours, lets see if it can finish the task successfully. |
Send message Joined: 15 Jun 08 Posts: 2528 Credit: 253,722,201 RAC: 51,175 |
Normal WU logs include the following lines: Guest Log: 00:00:00.041603 automount VBoxServiceAutoMountWorker: Shared folder "shared" was mounted to "/media/sf_shared" JV´s logs include only the first line: Guest Log: 00:00:00.032028 automount VBoxServiceAutoMountWorker: Shared folder "shared" was mounted to "/media/sf_shared" This looks like mounting the shared folder does not fully complete and the therefore the heartbeat file can´t be written. Then, after a timeout of 10 minutes a watchdog closes the VM. To use shared folders it´s mandatory to install the VirtualBox Extension Pack that fits to the main software version. Could you check if there are remains from an older Extension Pack in your registry? Could you check the access rights to the BOINC folders (including parent dirs)? Are all of your BOINC folders located on the same partition or distributed among different partitions? |
Send message Joined: 12 Feb 14 Posts: 72 Credit: 4,639,155 RAC: 0 |
How would I check the registry for remains of an old extension pack in the registry? The access rights of the BOINC folders look fine. The folders are in the same partition. |
Send message Joined: 15 Jun 08 Posts: 2528 Credit: 253,722,201 RAC: 51,175 |
The access rights of the BOINC folders look fine. o.k. How would I check the registry for remains of an old extension pack in the registry? Use regedit to check your registry. Search for keys related to VirtualBox. If there are version numbers that don´t match your currently installed VirtualBox version there is something messed up. If so: - uninstall VirtualBox - recheck the registry - delete old VirtualBox keys if they are still there - install VirtualBox again |
Send message Joined: 12 Feb 14 Posts: 72 Credit: 4,639,155 RAC: 0 |
I found nothing regarding inconsistent version numbers between my installation of VirtualBox and the VirtualBox Extension Pack like that. |
Send message Joined: 12 Feb 14 Posts: 72 Credit: 4,639,155 RAC: 0 |
I have one more possible hypothesis based on the message I am replying to. Could there be something wrong with the replication between the CERN CVMFS repository and the FNAL CVMFS repository? |
Send message Joined: 28 Sep 04 Posts: 728 Credit: 48,829,131 RAC: 20,906 |
My laptop has now finished and validated its first Theory task and now working on a CMS task. So in this case the heartbeat file errors were due to firewall/ISP/network connection. Originally the computer could not start any vbox tasks in the office due to the problem in the thread title but at home it started working after accepting the communication in the Windows firewall pop-up window. The window never showed up in the office. |
Send message Joined: 29 Aug 05 Posts: 1060 Credit: 7,737,445 RAC: 2,437 |
|
Send message Joined: 12 Feb 14 Posts: 72 Credit: 4,639,155 RAC: 0 |
An example that gets a file from CERN: I have repeated these experiments from my machine and have replicated the results (the first experiment gets a file while the second experiment gets an HTTP 404 Not Found error). Could someone please check the CVMFS to fix the HTTP 404 error I am getting for the file my work units are trying to retrieve? |
Send message Joined: 20 Jun 14 Posts: 380 Credit: 238,712 RAC: 0 |
Could someone please check the CVMFS to fix the HTTP 404 error I am getting for the file my work units are trying to retrieve? AFAIK this is not how CVMFS works. The first file was a CMVFS meta data file so it found it at that URL. However, the other file was not found as the file system does map to URLs in that way. As CVMFS support versions, the URL will probably be some random URL containing a few hashes. |
Send message Joined: 15 Jun 08 Posts: 2528 Credit: 253,722,201 RAC: 51,175 |
So what is the reason for the missing heartbeat file? 1. Wrong permissions of the shared folder? - Unrealistic as the shared folder is part of the slot the WU runs in and this structure is set up when the WU starts. 2. Wrong VirtualBox installation? - Older extension pack after an update, ... 3. CVMFS problems, missing files etc.? - Why not for more users? 4. Unrecognized firewall issues? - Harri Liljeroos´s tests point in this direction https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4052&postid=28241 - firewall messages may have been switched off in the past 5. Other errors that cause the heartbeat error as side effect? - probably not caused by a slow internet connection as Jesse Viviano´s should be fast enough |
Send message Joined: 12 Feb 14 Posts: 72 Credit: 4,639,155 RAC: 0 |
I just did a complete wipe and rebuild of my computer, and I am getting the same errors as before. I just did a complete wipe and rebuild of the computer, and it did not solve my CVMFS errors. I have to conclude that there is a server problem, a VirtualBox version problem, a client software problem, or a problem with AT&T Fiber. |
Send message Joined: 6 Sep 08 Posts: 118 Credit: 12,560,503 RAC: 318 |
The results that happen to be on the project database at the moment for my hosts show: 36 theory OK 3 missing heartbeat file 1 failed HTCondor ping. 6 CMS OK 1 missing heartbeat file. (There's one or two on the old vLHC 32bit theory tasks, too; but I've been changing things around there) From odd comments made in these forums I think that it affects many users. So what is the reason for the missing heartbeat file? From admittedly unscientific observations here it seems that:- 1. Whilst not caused by low internet speed (although this must only be true up to a point) it is related to network activity. Maybe slow or unreliable DNS, timeouts mount up, routers may need to forward DNS requests. If there are multiple hosts, starting them at intervals helps. 2. The way that particular routers handle NAT - the time for which incoming connections are accepted for example. Some people have opened the appropriate ports in their firewall rules so this may not be a problem for everyone. ISPs using carrier grade NAT to eke out IPv4 addresses probably doesn't help. 3. The limited number of simultaneous connections handled by the router. I think mine is limited to 5000 or 6000, after which it starts dropping packets, although I haven't noticed any failures fom this. On a broader note, it seems to me that projects using HTCondor have not been set up having regard to the vagaries of common domestic internet connections, not UK ADSL at any rate. I'm sure that, if the various timeouts etc. could be suitably adjusted, and suspend/resume made a bit more robust, current projects could run as smoothly as the original LHC (SixTrack/T4T) The Gold Standard. |
©2024 CERN