Message boards :
Number crunching :
Missing heartbeat file errors
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · Next
Author | Message |
---|---|
Send message Joined: 12 Feb 14 Posts: 72 Credit: 4,639,155 RAC: 0 |
I have tried a VirtualBox downgrade to version 5.1.10 which previously worked, and that did not work. |
Send message Joined: 12 Feb 14 Posts: 72 Credit: 4,639,155 RAC: 0 |
I am thinking of setting up Wireshark so I can capture the traffic from my Gigabit Ethernet port to allow the developer try to debug what is going on. Is any one of the developers able to accept the resulting PCAP file for debugging? |
Send message Joined: 12 Feb 14 Posts: 72 Credit: 4,639,155 RAC: 0 |
I have set up Wireshark, and have made sure to install Npcap beforehand in WinPcap compatibility mode so that Wireshark can use it. (Npcap is essentially WinPcap reworked to use NDIS 6 because Windows 10 deprecates NDIS 5 and might remove NDIS 5 at any time like it did in the Windows 10 beta which caused compatibility problems, and is also faster and more efficient. NDIS is a networking stack interface in Windows.) If a developer is ready for a PCAP, I will provide it. |
Send message Joined: 20 Jun 14 Posts: 380 Credit: 238,712 RAC: 0 |
On a broader note, it seems to me that projects using HTCondor have not been set up having regard to the vagaries of common domestic internet connections, not UK ADSL at any rate. I'm sure that, if the various timeouts etc. could be suitably adjusted, and suspend/resume made a bit more robust, current projects could run as smoothly as the original LHC (SixTrack/T4T) The Gold Standard. Thanks for your constructive suggestions. This is something that we would definitely like to get to the bottom of. Progress is a little slow at the moment due to the annual closure. Just to clarify, the smooth operation of the original LHC applications in this situation is just an illusion. The heartbeat mechanism was added as a protection against VMs failing to boot or hanging. Without this VMs would just continue running until the 24 hour time limit was reached for the task and it would be reported as success, although in many cases with a low value for the CPU time. The issues are also unrelated to HTCondor but maybe in some cases to CernVM and CVMFS. So although the perception maybe the applications have taken a step backwards, in fact by detecting an error condition and reporting it as a failure, it is a step forward. The step we are now trying to do is to identify the causes so we can either improve the detection/error messages or at least give a good troubleshooting guide. Volunteer Computing takes us away from the comfort of our data centres and into the wild corners of the Internet. There will be lots of common situations that were just not possible to consider and they can only be discovered through exploration. |
Send message Joined: 20 Jun 14 Posts: 380 Credit: 238,712 RAC: 0 |
I have set up Wireshark, and have made sure to install Npcap beforehand in WinPcap compatibility mode so that Wireshark can use it. (Npcap is essentially WinPcap reworked to use NDIS 6 because Windows 10 deprecates NDIS 5 and might remove NDIS 5 at any time like it did in the Windows 10 beta which caused compatibility problems, and is also faster and more efficient. NDIS is a networking stack interface in Windows.) If a developer is ready for a PCAP, I will provide it. Thanks for your efforts but it should not be necessary for you to go to all this trouble. My personal view is that if we have to resort to such methods, it is a red flag that we either haven't done the correct diagnostics or provided adequate tooling. We should try to isolate the test case and make it reproducible. I would suggest starting from the original CernVM image file. Download this image and use it to create a 64bit Linux VM in VirtualBox. If you start the VM, it should boot to the command prompt. Please let us know the result. |
Send message Joined: 12 Feb 14 Posts: 72 Credit: 4,639,155 RAC: 0 |
It goes to the screen with the following text requesting a login, not a command prompt: Welcome to CERN Virtual Machine, version 3.6.5.15 In short, it looks just like the screen that displays when an ATLAS@home task is properly functioning. |
Send message Joined: 2 May 07 Posts: 2244 Credit: 173,902,375 RAC: 456 |
It's the same for me. Shared Folders are not mounted. vbox.log and VBoxHardening.log are shown. Have VBoxGuestAdditions manuell mounted under IDE Secondary Master. |
Send message Joined: 20 Jun 14 Posts: 380 Credit: 238,712 RAC: 0 |
It goes to the screen with the following text requesting a login, not a command prompt: Your terminology correct. It should boot to the login prompt.
Great! This suggests that the VirtualBox installation, machine, network and CernVM are all working correctly. Now try repeating the test but this time add this iso image to the CDROM drive before you start the VM. If it reaches the login prompt, you can now login with the user name boinc and password debug. |
Send message Joined: 2 May 07 Posts: 2244 Credit: 173,902,375 RAC: 456 |
If it reaches the login prompt, you can now login with the user name boinc and password debug. Booting ok - same login screen as yesterday: localhost login: boinc Password: after password debug following line is seeing: [boincatlocalhost ~]$ Is this ok? Edit: Storage in VM IDE Primary Master: ucernvm-prod.2.7-7.cernvm.x86_64.vdi (Normal 20,00 GB) IDE Primary Slave: [Optical Drive]context.iso(356,00 KB) IDE Secondary Master: [Optical Drive]VBoxGuestAdditions.iso (56,66 MB) |
Send message Joined: 20 Jun 14 Posts: 380 Credit: 238,712 RAC: 0 |
Yep. Here is the next step. Create a shared directory for that VM. It should have the name shared. Place the init_data.xml file, which you can harvest from one of the slot directories from the BOINC client state directory when running the Theory app from LHC@home, into that shared directory on your machine. Once you have done that, run the following command from the command prompt of your VM. /cvmfs/grid.cern.ch/vc/sbin/bootstrap |
Send message Joined: 2 May 07 Posts: 2244 Credit: 173,902,375 RAC: 456 |
last line in VBox.log 00:01:18.228983 VMMDev: Guest Log: 00:00:00.097216 automount VBoxServiceAutoMountWorker: Shared folder "shared" was mounted to "/media/sf_shared" last lines in command line of VM: cp: cannot create regular file /etc/grid-security/certificates/e3ed16b8.signing_policy: Permission denied sed: can't read /var/lib/boinc/shared/init_data.xml: No such file or directory 21:22:39 +0100 2017-01-03 [INFO] Reading volunteer information ERROR init_data.xml not found in /var/lib/boinc/shared EDIT: saw this line also: cp: /cvmfs/grid.cern.ch/etc/grid-security/certificates/seegrid-ca-2013.signing_policy and /etc/grid-security/certificates/seegrid-ca-2013.signing_policy are the same file |
Send message Joined: 20 Jun 14 Posts: 380 Credit: 238,712 RAC: 0 |
Sorry, try running sudo /cvmfs/grid.cern.ch/vc/sbin/bootstrap |
Send message Joined: 2 May 07 Posts: 2244 Credit: 173,902,375 RAC: 456 |
This is nice: vccondor01.cern.ch 9618 tcp/condor succeeded! last line in the moment: boincatlocalhost 22:25:49 +0100 2017-01-03 {DEBUG] HTCondor ping 22:25:51 +0100 2017-01-03 {DEBUG] 0 EDIT: VBox.log show: 00:06:43.720852 VMMDev: Guest Log: [INFO] New Job Starting in slot1 00:06:43.804921 VMMDev: Guest Log: [INFO] Condor JobID: 827840. 0 in slot1 00:06:48.909850 VMMDev: Guest Log: [INFO] MCPlots JobID: 34585362 in slot1 |
Send message Joined: 20 Jun 14 Posts: 380 Credit: 238,712 RAC: 0 |
It looks like it is working. Should run for upto 18 hours. It will not shutdown but write a file in the shared directory when done. If this works but boinc not, it will either be the configuration of the vm or the image. |
Send message Joined: 12 Feb 14 Posts: 72 Credit: 4,639,155 RAC: 0 |
I just ran these tests as well, and it looks like the job successfully starts. It works, but not in BOINC. |
Send message Joined: 29 Aug 05 Posts: 1061 Credit: 7,737,455 RAC: 298 |
|
Send message Joined: 2 May 07 Posts: 2244 Credit: 173,902,375 RAC: 456 |
shared directory shows a file heartbeat as a new file, with zero Bytes. The date is changed every minute. The second file is init_data.xml from yesterday. |
Send message Joined: 14 Jan 10 Posts: 1422 Credit: 9,484,585 RAC: 1,266 |
Laurence wrote: It looks like it is working. Should run for upto 18 hours. It will not shutdown but write a file in the shared directory when done. I think it will not request more jobs (if not killed by the user) after 12 hours when the then running job has finished. Normally a shutdown file will be created in the shared folder, but the VM will not stop/shutdown cause no vboxwrapper is involved. Also not after 18 hours lifetime, because of the same reason. |
Send message Joined: 2 May 07 Posts: 2244 Credit: 173,902,375 RAC: 456 |
Thanks CP, shutdown file in shared directory 10:06 UTC after 12 hours and 47 minutes. 00:06:43.804921 VMMDev: Guest Log: [INFO] Condor JobID: 827840. 0 in slot1 00:06:48.909850 VMMDev: Guest Log: [INFO] MCPlots JobID: 34585362 in slot1 01:53:46.844961 VMMDev: Guest Log: [INFO] Condor JobID: 829728. 0 in slot1 01:53:51.955572 VMMDev: Guest Log: [INFO] MCPlots JobID: 34587449 in slot1 02:31:35.783722 VMMDev: Guest Log: [INFO] Condor JobID: 830484. 0 in slot1 02:31:40.903547 VMMDev: Guest Log: [INFO] MCPlots JobID: 34587901 in slot1 04:06:21.272829 VMMDev: Guest Log: [INFO] Condor JobID: 832066. 0 in slot1 04:06:26.384362 VMMDev: Guest Log: [INFO] MCPlots JobID: 34589785 in slot1 04:31:23.786617 VMMDev: Guest Log: [INFO] Condor JobID: 832458. 0 in slot1 04:31:28.881326 VMMDev: Guest Log: [INFO] MCPlots JobID: 34590022 in slot1 05:19:14.461709 VMMDev: Guest Log: [INFO] Condor JobID: 833167. 0 in slot1 05:19:22.614642 VMMDev: Guest Log: [INFO] MCPlots JobID: 34590732 in slot1 08:54:39.781990 VMMDev: Guest Log: [INFO] Condor JobID: 836396. 0 in slot1 08:54:44.876285 VMMDev: Guest Log: [INFO] MCPlots JobID: 34593868 in slot1 11:00:16.891918 VMMDev: Guest Log: [INFO] Condor JobID: 838155. 0 in slot1 11:00:22.001115 VMMDev: Guest Log: [INFO] MCPlots JobID: 34595895 in slot1 11:34:09.327877 VMMDev: Guest Log: [INFO] Condor JobID: 838672. 0 in slot1 11:34:14.456055 VMMDev: Guest Log: [INFO] MCPlots JobID: 34596135 in slot1 12:36:56.543292 VMMDev: Guest Log: [INFO] Job finished in slot1 with 0. 12:46:23.065507 VMMDev: Guest Log: [INFO] Condor exited with return value N/A. 12:46:23.126045 VMMDev: Guest Log: [INFO] Shutting Down. 12:47:50.377031 GUI: UIMediumEnumerator: Medium-enumeration started... 12:47:50.610321 GUI: UIMediumEnumerator: Medium-enumeration finished! VM is shuting down from me in this moment. |
Send message Joined: 20 Jun 14 Posts: 380 Credit: 238,712 RAC: 0 |
Perfect! Now take that very same VM, eject the CDROM image and replace the hard disk image with the Theory_2016_11_02.vdi from this project. Start the VM and see what happens. Note that to use the image directly in this way you need to provide the shared directory with the init_data.xml file. |
©2024 CERN