Message boards : Number crunching : Missing heartbeat file errors
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · Next

AuthorMessage
Jesse Viviano

Send message
Joined: 12 Feb 14
Posts: 72
Credit: 4,639,155
RAC: 0
Message 28302 - Posted: 31 Dec 2016, 15:57:22 UTC

I have tried a VirtualBox downgrade to version 5.1.10 which previously worked, and that did not work.
ID: 28302 · Report as offensive     Reply Quote
Jesse Viviano

Send message
Joined: 12 Feb 14
Posts: 72
Credit: 4,639,155
RAC: 0
Message 28304 - Posted: 1 Jan 2017, 23:42:48 UTC

I am thinking of setting up Wireshark so I can capture the traffic from my Gigabit Ethernet port to allow the developer try to debug what is going on. Is any one of the developers able to accept the resulting PCAP file for debugging?
ID: 28304 · Report as offensive     Reply Quote
Jesse Viviano

Send message
Joined: 12 Feb 14
Posts: 72
Credit: 4,639,155
RAC: 0
Message 28305 - Posted: 2 Jan 2017, 1:58:59 UTC

I have set up Wireshark, and have made sure to install Npcap beforehand in WinPcap compatibility mode so that Wireshark can use it. (Npcap is essentially WinPcap reworked to use NDIS 6 because Windows 10 deprecates NDIS 5 and might remove NDIS 5 at any time like it did in the Windows 10 beta which caused compatibility problems, and is also faster and more efficient. NDIS is a networking stack interface in Windows.) If a developer is ready for a PCAP, I will provide it.
ID: 28305 · Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer

Send message
Joined: 20 Jun 14
Posts: 372
Credit: 238,712
RAC: 0
Message 28311 - Posted: 2 Jan 2017, 21:38:30 UTC - in response to Message 28301.  

On a broader note, it seems to me that projects using HTCondor have not been set up having regard to the vagaries of common domestic internet connections, not UK ADSL at any rate. I'm sure that, if the various timeouts etc. could be suitably adjusted, and suspend/resume made a bit more robust, current projects could run as smoothly as the original LHC (SixTrack/T4T) The Gold Standard.


Thanks for your constructive suggestions. This is something that we would definitely like to get to the bottom of. Progress is a little slow at the moment due to the annual closure. Just to clarify, the smooth operation of the original LHC applications in this situation is just an illusion. The heartbeat mechanism was added as a protection against VMs failing to boot or hanging. Without this VMs would just continue running until the 24 hour time limit was reached for the task and it would be reported as success, although in many cases with a low value for the CPU time. The issues are also unrelated to HTCondor but maybe in some cases to CernVM and CVMFS. So although the perception maybe the applications have taken a step backwards, in fact by detecting an error condition and reporting it as a failure, it is a step forward. The step we are now trying to do is to identify the causes so we can either improve the detection/error messages or at least give a good troubleshooting guide. Volunteer Computing takes us away from the comfort of our data centres and into the wild corners of the Internet. There will be lots of common situations that were just not possible to consider and they can only be discovered through exploration.
ID: 28311 · Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer

Send message
Joined: 20 Jun 14
Posts: 372
Credit: 238,712
RAC: 0
Message 28312 - Posted: 2 Jan 2017, 21:50:56 UTC - in response to Message 28305.  

I have set up Wireshark, and have made sure to install Npcap beforehand in WinPcap compatibility mode so that Wireshark can use it. (Npcap is essentially WinPcap reworked to use NDIS 6 because Windows 10 deprecates NDIS 5 and might remove NDIS 5 at any time like it did in the Windows 10 beta which caused compatibility problems, and is also faster and more efficient. NDIS is a networking stack interface in Windows.) If a developer is ready for a PCAP, I will provide it.


Thanks for your efforts but it should not be necessary for you to go to all this trouble. My personal view is that if we have to resort to such methods, it is a red flag that we either haven't done the correct diagnostics or provided adequate tooling. We should try to isolate the test case and make it reproducible.

I would suggest starting from the original CernVM image file. Download this image and use it to create a 64bit Linux VM in VirtualBox. If you start the VM, it should boot to the command prompt. Please let us know the result.
ID: 28312 · Report as offensive     Reply Quote
Jesse Viviano

Send message
Joined: 12 Feb 14
Posts: 72
Credit: 4,639,155
RAC: 0
Message 28314 - Posted: 2 Jan 2017, 22:59:53 UTC - in response to Message 28312.  
Last modified: 2 Jan 2017, 23:41:18 UTC

It goes to the screen with the following text requesting a login, not a command prompt:
Welcome to CERN Virtual Machine, version 3.6.5.15
based on Scientific Linux release 6.8 (Carbon)
Kernel 4.1.35-25.cernvm.x86_64 on an x86_64

IP Address of this VM: 10.0.2.15
In order to apply cernvm-online context, use #<PIN> as user name.

localhost login: _

In short, it looks just like the screen that displays when an ATLAS@home task is properly functioning.
ID: 28314 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2071
Credit: 156,084,038
RAC: 105,553
Message 28317 - Posted: 3 Jan 2017, 9:03:56 UTC

It's the same for me.
Shared Folders are not mounted.
vbox.log and VBoxHardening.log are shown.
Have VBoxGuestAdditions manuell mounted under IDE Secondary Master.
ID: 28317 · Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer

Send message
Joined: 20 Jun 14
Posts: 372
Credit: 238,712
RAC: 0
Message 28320 - Posted: 3 Jan 2017, 19:08:55 UTC - in response to Message 28314.  

It goes to the screen with the following text requesting a login, not a command prompt:

Your terminology correct. It should boot to the login prompt.


In short, it looks just like the screen that displays when an ATLAS@home task is properly functioning.


Great! This suggests that the VirtualBox installation, machine, network and CernVM are all working correctly.

Now try repeating the test but this time add this iso image to the CDROM drive before you start the VM. If it reaches the login prompt, you can now login with the user name boinc and password debug.
ID: 28320 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2071
Credit: 156,084,038
RAC: 105,553
Message 28321 - Posted: 3 Jan 2017, 19:33:43 UTC - in response to Message 28320.  
Last modified: 3 Jan 2017, 19:48:40 UTC

If it reaches the login prompt, you can now login with the user name boinc and password debug.


Booting ok - same login screen as yesterday:

localhost login: boinc
Password:

after password debug following line is seeing:

[boincatlocalhost ~]$

Is this ok?

Edit: Storage in VM
IDE Primary Master: ucernvm-prod.2.7-7.cernvm.x86_64.vdi (Normal 20,00 GB)
IDE Primary Slave: [Optical Drive]context.iso(356,00 KB)
IDE Secondary Master: [Optical Drive]VBoxGuestAdditions.iso (56,66 MB)
ID: 28321 · Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer

Send message
Joined: 20 Jun 14
Posts: 372
Credit: 238,712
RAC: 0
Message 28323 - Posted: 3 Jan 2017, 19:50:26 UTC - in response to Message 28321.  


Is this ok?


Yep. Here is the next step.

Create a shared directory for that VM. It should have the name shared. Place the init_data.xml file, which you can harvest from one of the slot directories from the BOINC client state directory when running the Theory app from LHC@home, into that shared directory on your machine. Once you have done that, run the following command from the command prompt of your VM.

/cvmfs/grid.cern.ch/vc/sbin/bootstrap
ID: 28323 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2071
Credit: 156,084,038
RAC: 105,553
Message 28324 - Posted: 3 Jan 2017, 20:32:10 UTC
Last modified: 3 Jan 2017, 20:45:28 UTC

last line in VBox.log

00:01:18.228983 VMMDev: Guest Log: 00:00:00.097216 automount VBoxServiceAutoMountWorker: Shared folder "shared" was mounted to "/media/sf_shared"

last lines in command line of VM:
cp: cannot create regular file /etc/grid-security/certificates/e3ed16b8.signing_policy: Permission denied

sed: can't read /var/lib/boinc/shared/init_data.xml: No such file or directory
21:22:39 +0100 2017-01-03 [INFO] Reading volunteer information
ERROR init_data.xml not found in /var/lib/boinc/shared

EDIT: saw this line also:

cp: /cvmfs/grid.cern.ch/etc/grid-security/certificates/seegrid-ca-2013.signing_policy and /etc/grid-security/certificates/seegrid-ca-2013.signing_policy are the same file
ID: 28324 · Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer

Send message
Joined: 20 Jun 14
Posts: 372
Credit: 238,712
RAC: 0
Message 28325 - Posted: 3 Jan 2017, 21:11:19 UTC - in response to Message 28324.  
Last modified: 3 Jan 2017, 21:11:33 UTC

Sorry, try running

sudo /cvmfs/grid.cern.ch/vc/sbin/bootstrap
ID: 28325 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2071
Credit: 156,084,038
RAC: 105,553
Message 28326 - Posted: 3 Jan 2017, 21:31:59 UTC - in response to Message 28325.  
Last modified: 3 Jan 2017, 21:50:30 UTC

This is nice:

vccondor01.cern.ch 9618 tcp/condor succeeded!

last line in the moment:
boincatlocalhost 22:25:49 +0100 2017-01-03 {DEBUG] HTCondor ping
22:25:51 +0100 2017-01-03 {DEBUG] 0

EDIT: VBox.log show:
00:06:43.720852 VMMDev: Guest Log: [INFO] New Job Starting in slot1
00:06:43.804921 VMMDev: Guest Log: [INFO] Condor JobID: 827840. 0 in slot1
00:06:48.909850 VMMDev: Guest Log: [INFO] MCPlots JobID: 34585362 in slot1
ID: 28326 · Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer

Send message
Joined: 20 Jun 14
Posts: 372
Credit: 238,712
RAC: 0
Message 28327 - Posted: 3 Jan 2017, 22:36:33 UTC - in response to Message 28326.  
Last modified: 3 Jan 2017, 22:38:06 UTC

It looks like it is working. Should run for upto 18 hours. It will not shutdown but write a file in the shared directory when done.

If this works but boinc not, it will either be the configuration of the vm or the image.
ID: 28327 · Report as offensive     Reply Quote
Jesse Viviano

Send message
Joined: 12 Feb 14
Posts: 72
Credit: 4,639,155
RAC: 0
Message 28328 - Posted: 3 Jan 2017, 23:31:38 UTC - in response to Message 28327.  

I just ran these tests as well, and it looks like the job successfully starts. It works, but not in BOINC.
ID: 28328 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 997
Credit: 6,264,307
RAC: 71
Message 28329 - Posted: 3 Jan 2017, 23:57:46 UTC - in response to Message 28328.  

I just ran these tests as well, and it looks like the job successfully starts. It works, but not in BOINC.

Thanks, Jesse. Progress at last, more info for the debugging!
ID: 28329 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2071
Credit: 156,084,038
RAC: 105,553
Message 28330 - Posted: 4 Jan 2017, 9:55:39 UTC

shared directory shows a file heartbeat as a new file, with zero Bytes.
The date is changed every minute.
The second file is init_data.xml from yesterday.
ID: 28330 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1268
Credit: 8,421,616
RAC: 2,139
Message 28331 - Posted: 4 Jan 2017, 10:11:58 UTC - in response to Message 28327.  

Laurence wrote:
It looks like it is working. Should run for upto 18 hours. It will not shutdown but write a file in the shared directory when done.

I think it will not request more jobs (if not killed by the user) after 12 hours when the then running job has finished.
Normally a shutdown file will be created in the shared folder, but the VM will not stop/shutdown cause no vboxwrapper is involved.
Also not after 18 hours lifetime, because of the same reason.
ID: 28331 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2071
Credit: 156,084,038
RAC: 105,553
Message 28334 - Posted: 4 Jan 2017, 10:55:00 UTC - in response to Message 28331.  

Thanks CP,

shutdown file in shared directory 10:06 UTC after 12 hours and 47 minutes.

00:06:43.804921 VMMDev: Guest Log: [INFO] Condor JobID: 827840. 0 in slot1
00:06:48.909850 VMMDev: Guest Log: [INFO] MCPlots JobID: 34585362 in slot1

01:53:46.844961 VMMDev: Guest Log: [INFO] Condor JobID: 829728. 0 in slot1
01:53:51.955572 VMMDev: Guest Log: [INFO] MCPlots JobID: 34587449 in slot1

02:31:35.783722 VMMDev: Guest Log: [INFO] Condor JobID: 830484. 0 in slot1
02:31:40.903547 VMMDev: Guest Log: [INFO] MCPlots JobID: 34587901 in slot1

04:06:21.272829 VMMDev: Guest Log: [INFO] Condor JobID: 832066. 0 in slot1
04:06:26.384362 VMMDev: Guest Log: [INFO] MCPlots JobID: 34589785 in slot1

04:31:23.786617 VMMDev: Guest Log: [INFO] Condor JobID: 832458. 0 in slot1
04:31:28.881326 VMMDev: Guest Log: [INFO] MCPlots JobID: 34590022 in slot1

05:19:14.461709 VMMDev: Guest Log: [INFO] Condor JobID: 833167. 0 in slot1
05:19:22.614642 VMMDev: Guest Log: [INFO] MCPlots JobID: 34590732 in slot1

08:54:39.781990 VMMDev: Guest Log: [INFO] Condor JobID: 836396. 0 in slot1
08:54:44.876285 VMMDev: Guest Log: [INFO] MCPlots JobID: 34593868 in slot1

11:00:16.891918 VMMDev: Guest Log: [INFO] Condor JobID: 838155. 0 in slot1
11:00:22.001115 VMMDev: Guest Log: [INFO] MCPlots JobID: 34595895 in slot1

11:34:09.327877 VMMDev: Guest Log: [INFO] Condor JobID: 838672. 0 in slot1
11:34:14.456055 VMMDev: Guest Log: [INFO] MCPlots JobID: 34596135 in slot1

12:36:56.543292 VMMDev: Guest Log: [INFO] Job finished in slot1 with 0.
12:46:23.065507 VMMDev: Guest Log: [INFO] Condor exited with return value N/A.
12:46:23.126045 VMMDev: Guest Log: [INFO] Shutting Down.
12:47:50.377031 GUI: UIMediumEnumerator: Medium-enumeration started...
12:47:50.610321 GUI: UIMediumEnumerator: Medium-enumeration finished!

VM is shuting down from me in this moment.
ID: 28334 · Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer

Send message
Joined: 20 Jun 14
Posts: 372
Credit: 238,712
RAC: 0
Message 28346 - Posted: 4 Jan 2017, 19:55:11 UTC - in response to Message 28334.  
Last modified: 4 Jan 2017, 23:16:24 UTC

Perfect! Now take that very same VM, eject the CDROM image and replace the hard disk image with the Theory_2016_11_02.vdi from this project. Start the VM and see what happens. Note that to use the image directly in this way you need to provide the shared directory with the init_data.xml file.
ID: 28346 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · Next

Message boards : Number crunching : Missing heartbeat file errors


©2024 CERN