Thread 'Missing heartbeat file errors'

Author	Message
Jesse Viviano Send message Joined: 12 Feb 14 Posts: 72 Credit: 4,639,155 RAC: 0	Message 28302 - Posted: 31 Dec 2016, 15:57:22 UTC I have tried a VirtualBox downgrade to version 5.1.10 which previously worked, and that did not work. ID: 28302 · Reply Quote

Jesse Viviano Send message Joined: 12 Feb 14 Posts: 72 Credit: 4,639,155 RAC: 0	Message 28304 - Posted: 1 Jan 2017, 23:42:48 UTC I am thinking of setting up Wireshark so I can capture the traffic from my Gigabit Ethernet port to allow the developer try to debug what is going on. Is any one of the developers able to accept the resulting PCAP file for debugging? ID: 28304 · Reply Quote

Jesse Viviano Send message Joined: 12 Feb 14 Posts: 72 Credit: 4,639,155 RAC: 0	Message 28305 - Posted: 2 Jan 2017, 1:58:59 UTC I have set up Wireshark, and have made sure to install Npcap beforehand in WinPcap compatibility mode so that Wireshark can use it. (Npcap is essentially WinPcap reworked to use NDIS 6 because Windows 10 deprecates NDIS 5 and might remove NDIS 5 at any time like it did in the Windows 10 beta which caused compatibility problems, and is also faster and more efficient. NDIS is a networking stack interface in Windows.) If a developer is ready for a PCAP, I will provide it. ID: 28305 · Reply Quote

Laurence Project administrator Project developer Send message Joined: 20 Jun 14 Posts: 431 Credit: 254,018 RAC: 212	Message 28311 - Posted: 2 Jan 2017, 21:38:30 UTC - in response to Message 28301. On a broader note, it seems to me that projects using HTCondor have not been set up having regard to the vagaries of common domestic internet connections, not UK ADSL at any rate. I'm sure that, if the various timeouts etc. could be suitably adjusted, and suspend/resume made a bit more robust, current projects could run as smoothly as the original LHC (SixTrack/T4T) The Gold Standard. Thanks for your constructive suggestions. This is something that we would definitely like to get to the bottom of. Progress is a little slow at the moment due to the annual closure. Just to clarify, the smooth operation of the original LHC applications in this situation is just an illusion. The heartbeat mechanism was added as a protection against VMs failing to boot or hanging. Without this VMs would just continue running until the 24 hour time limit was reached for the task and it would be reported as success, although in many cases with a low value for the CPU time. The issues are also unrelated to HTCondor but maybe in some cases to CernVM and CVMFS. So although the perception maybe the applications have taken a step backwards, in fact by detecting an error condition and reporting it as a failure, it is a step forward. The step we are now trying to do is to identify the causes so we can either improve the detection/error messages or at least give a good troubleshooting guide. Volunteer Computing takes us away from the comfort of our data centres and into the wild corners of the Internet. There will be lots of common situations that were just not possible to consider and they can only be discovered through exploration. ID: 28311 · Reply Quote

Laurence Project administrator Project developer Send message Joined: 20 Jun 14 Posts: 431 Credit: 254,018 RAC: 212	Message 28312 - Posted: 2 Jan 2017, 21:50:56 UTC - in response to Message 28305. I have set up Wireshark, and have made sure to install Npcap beforehand in WinPcap compatibility mode so that Wireshark can use it. (Npcap is essentially WinPcap reworked to use NDIS 6 because Windows 10 deprecates NDIS 5 and might remove NDIS 5 at any time like it did in the Windows 10 beta which caused compatibility problems, and is also faster and more efficient. NDIS is a networking stack interface in Windows.) If a developer is ready for a PCAP, I will provide it. Thanks for your efforts but it should not be necessary for you to go to all this trouble. My personal view is that if we have to resort to such methods, it is a red flag that we either haven't done the correct diagnostics or provided adequate tooling. We should try to isolate the test case and make it reproducible. I would suggest starting from the original CernVM image file. Download this image and use it to create a 64bit Linux VM in VirtualBox. If you start the VM, it should boot to the command prompt. Please let us know the result. ID: 28312 · Reply Quote

Jesse Viviano Send message Joined: 12 Feb 14 Posts: 72 Credit: 4,639,155 RAC: 0	Message 28314 - Posted: 2 Jan 2017, 22:59:53 UTC - in response to Message 28312. Last modified: 2 Jan 2017, 23:41:18 UTC It goes to the screen with the following text requesting a login, not a command prompt: Welcome to CERN Virtual Machine, version 3.6.5.15 based on Scientific Linux release 6.8 (Carbon) Kernel 4.1.35-25.cernvm.x86_64 on an x86_64 IP Address of this VM: 10.0.2.15 In order to apply cernvm-online context, use #<PIN> as user name. localhost login: _ In short, it looks just like the screen that displays when an ATLAS@home task is properly functioning. ID: 28314 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2285 Credit: 178,825,305 RAC: 571	Message 28317 - Posted: 3 Jan 2017, 9:03:56 UTC It's the same for me. Shared Folders are not mounted. vbox.log and VBoxHardening.log are shown. Have VBoxGuestAdditions manuell mounted under IDE Secondary Master. ID: 28317 · Reply Quote

Laurence Project administrator Project developer Send message Joined: 20 Jun 14 Posts: 431 Credit: 254,018 RAC: 212	Message 28320 - Posted: 3 Jan 2017, 19:08:55 UTC - in response to Message 28314. It goes to the screen with the following text requesting a login, not a command prompt: Your terminology correct. It should boot to the login prompt. In short, it looks just like the screen that displays when an ATLAS@home task is properly functioning. Great! This suggests that the VirtualBox installation, machine, network and CernVM are all working correctly. Now try repeating the test but this time add this iso image to the CDROM drive before you start the VM. If it reaches the login prompt, you can now login with the user name boinc and password debug. ID: 28320 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2285 Credit: 178,825,305 RAC: 571	Message 28321 - Posted: 3 Jan 2017, 19:33:43 UTC - in response to Message 28320. Last modified: 3 Jan 2017, 19:48:40 UTC If it reaches the login prompt, you can now login with the user name boinc and password debug. Booting ok - same login screen as yesterday: localhost login: boinc Password: after password debug following line is seeing: [boincatlocalhost ~]$ Is this ok? Edit: Storage in VM IDE Primary Master: ucernvm-prod.2.7-7.cernvm.x86_64.vdi (Normal 20,00 GB) IDE Primary Slave: [Optical Drive]context.iso(356,00 KB) IDE Secondary Master: [Optical Drive]VBoxGuestAdditions.iso (56,66 MB) ID: 28321 · Reply Quote

Laurence Project administrator Project developer Send message Joined: 20 Jun 14 Posts: 431 Credit: 254,018 RAC: 212	Message 28323 - Posted: 3 Jan 2017, 19:50:26 UTC - in response to Message 28321. Is this ok? Yep. Here is the next step. Create a shared directory for that VM. It should have the name shared. Place the init_data.xml file, which you can harvest from one of the slot directories from the BOINC client state directory when running the Theory app from LHC@home, into that shared directory on your machine. Once you have done that, run the following command from the command prompt of your VM. /cvmfs/grid.cern.ch/vc/sbin/bootstrap ID: 28323 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2285 Credit: 178,825,305 RAC: 571	Message 28324 - Posted: 3 Jan 2017, 20:32:10 UTC Last modified: 3 Jan 2017, 20:45:28 UTC last line in VBox.log 00:01:18.228983 VMMDev: Guest Log: 00:00:00.097216 automount VBoxServiceAutoMountWorker: Shared folder "shared" was mounted to "/media/sf_shared" last lines in command line of VM: cp: cannot create regular file /etc/grid-security/certificates/e3ed16b8.signing_policy: Permission denied sed: can't read /var/lib/boinc/shared/init_data.xml: No such file or directory 21:22:39 +0100 2017-01-03 [INFO] Reading volunteer information ERROR init_data.xml not found in /var/lib/boinc/shared EDIT: saw this line also: cp: /cvmfs/grid.cern.ch/etc/grid-security/certificates/seegrid-ca-2013.signing_policy and /etc/grid-security/certificates/seegrid-ca-2013.signing_policy are the same file ID: 28324 · Reply Quote

Laurence Project administrator Project developer Send message Joined: 20 Jun 14 Posts: 431 Credit: 254,018 RAC: 212	Message 28325 - Posted: 3 Jan 2017, 21:11:19 UTC - in response to Message 28324. Last modified: 3 Jan 2017, 21:11:33 UTC Sorry, try running sudo /cvmfs/grid.cern.ch/vc/sbin/bootstrap ID: 28325 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2285 Credit: 178,825,305 RAC: 571	Message 28326 - Posted: 3 Jan 2017, 21:31:59 UTC - in response to Message 28325. Last modified: 3 Jan 2017, 21:50:30 UTC This is nice: vccondor01.cern.ch 9618 tcp/condor succeeded! last line in the moment: boincatlocalhost 22:25:49 +0100 2017-01-03 {DEBUG] HTCondor ping 22:25:51 +0100 2017-01-03 {DEBUG] 0 EDIT: VBox.log show: 00:06:43.720852 VMMDev: Guest Log: [INFO] New Job Starting in slot1 00:06:43.804921 VMMDev: Guest Log: [INFO] Condor JobID: 827840. 0 in slot1 00:06:48.909850 VMMDev: Guest Log: [INFO] MCPlots JobID: 34585362 in slot1 ID: 28326 · Reply Quote

Laurence Project administrator Project developer Send message Joined: 20 Jun 14 Posts: 431 Credit: 254,018 RAC: 212	Message 28327 - Posted: 3 Jan 2017, 22:36:33 UTC - in response to Message 28326. Last modified: 3 Jan 2017, 22:38:06 UTC It looks like it is working. Should run for upto 18 hours. It will not shutdown but write a file in the shared directory when done. If this works but boinc not, it will either be the configuration of the vm or the image. ID: 28327 · Reply Quote

Jesse Viviano Send message Joined: 12 Feb 14 Posts: 72 Credit: 4,639,155 RAC: 0	Message 28328 - Posted: 3 Jan 2017, 23:31:38 UTC - in response to Message 28327. I just ran these tests as well, and it looks like the job successfully starts. It works, but not in BOINC. ID: 28328 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1152 Credit: 11,734,920 RAC: 1,451	Message 28329 - Posted: 3 Jan 2017, 23:57:46 UTC - in response to Message 28328. I just ran these tests as well, and it looks like the job successfully starts. It works, but not in BOINC. Thanks, Jesse. Progress at last, more info for the debugging! ID: 28329 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2285 Credit: 178,825,305 RAC: 571	Message 28330 - Posted: 4 Jan 2017, 9:55:39 UTC shared directory shows a file heartbeat as a new file, with zero Bytes. The date is changed every minute. The second file is init_data.xml from yesterday. ID: 28330 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1531 Credit: 10,033,088 RAC: 1,183	Message 28331 - Posted: 4 Jan 2017, 10:11:58 UTC - in response to Message 28327. Laurence wrote: It looks like it is working. Should run for upto 18 hours. It will not shutdown but write a file in the shared directory when done. I think it will not request more jobs (if not killed by the user) after 12 hours when the then running job has finished. Normally a shutdown file will be created in the shared folder, but the VM will not stop/shutdown cause no vboxwrapper is involved. Also not after 18 hours lifetime, because of the same reason. ID: 28331 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2285 Credit: 178,825,305 RAC: 571	Message 28334 - Posted: 4 Jan 2017, 10:55:00 UTC - in response to Message 28331. Thanks CP, shutdown file in shared directory 10:06 UTC after 12 hours and 47 minutes. 00:06:43.804921 VMMDev: Guest Log: [INFO] Condor JobID: 827840. 0 in slot1 00:06:48.909850 VMMDev: Guest Log: [INFO] MCPlots JobID: 34585362 in slot1 01:53:46.844961 VMMDev: Guest Log: [INFO] Condor JobID: 829728. 0 in slot1 01:53:51.955572 VMMDev: Guest Log: [INFO] MCPlots JobID: 34587449 in slot1 02:31:35.783722 VMMDev: Guest Log: [INFO] Condor JobID: 830484. 0 in slot1 02:31:40.903547 VMMDev: Guest Log: [INFO] MCPlots JobID: 34587901 in slot1 04:06:21.272829 VMMDev: Guest Log: [INFO] Condor JobID: 832066. 0 in slot1 04:06:26.384362 VMMDev: Guest Log: [INFO] MCPlots JobID: 34589785 in slot1 04:31:23.786617 VMMDev: Guest Log: [INFO] Condor JobID: 832458. 0 in slot1 04:31:28.881326 VMMDev: Guest Log: [INFO] MCPlots JobID: 34590022 in slot1 05:19:14.461709 VMMDev: Guest Log: [INFO] Condor JobID: 833167. 0 in slot1 05:19:22.614642 VMMDev: Guest Log: [INFO] MCPlots JobID: 34590732 in slot1 08:54:39.781990 VMMDev: Guest Log: [INFO] Condor JobID: 836396. 0 in slot1 08:54:44.876285 VMMDev: Guest Log: [INFO] MCPlots JobID: 34593868 in slot1 11:00:16.891918 VMMDev: Guest Log: [INFO] Condor JobID: 838155. 0 in slot1 11:00:22.001115 VMMDev: Guest Log: [INFO] MCPlots JobID: 34595895 in slot1 11:34:09.327877 VMMDev: Guest Log: [INFO] Condor JobID: 838672. 0 in slot1 11:34:14.456055 VMMDev: Guest Log: [INFO] MCPlots JobID: 34596135 in slot1 12:36:56.543292 VMMDev: Guest Log: [INFO] Job finished in slot1 with 0. 12:46:23.065507 VMMDev: Guest Log: [INFO] Condor exited with return value N/A. 12:46:23.126045 VMMDev: Guest Log: [INFO] Shutting Down. 12:47:50.377031 GUI: UIMediumEnumerator: Medium-enumeration started... 12:47:50.610321 GUI: UIMediumEnumerator: Medium-enumeration finished! VM is shuting down from me in this moment. ID: 28334 · Reply Quote

Laurence Project administrator Project developer Send message Joined: 20 Jun 14 Posts: 431 Credit: 254,018 RAC: 212	Message 28346 - Posted: 4 Jan 2017, 19:55:11 UTC - in response to Message 28334. Last modified: 4 Jan 2017, 23:16:24 UTC Perfect! Now take that very same VM, eject the CDROM image and replace the hard disk image with the Theory_2016_11_02.vdi from this project. Start the VM and see what happens. Note that to use the image directly in this way you need to provide the shared directory with the init_data.xml file. ID: 28346 · Reply Quote