Message boards : Number crunching : Missing heartbeat file errors
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · Next

AuthorMessage
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1274
Credit: 8,480,870
RAC: 2,011
Message 28349 - Posted: 4 Jan 2017, 22:27:33 UTC - in response to Message 28346.  

Perfect! Now take that very same VM, eject the CDROM image and replace the hard disk image with the Theory_2016_11_02.vdi from this project. Start the VM and see what happens. Note that to use the image directly in this way you need to provided the shared directory with the init_data.xml file.

VM with your proposed config started well.
After 1 minute uptime the new heartbeat file was created in the shared directory and 1 minute later a new job started.
I'll let it run overnight.
ID: 28349 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2090
Credit: 158,922,687
RAC: 124,848
Message 28352 - Posted: 5 Jan 2017, 9:23:06 UTC
Last modified: 5 Jan 2017, 9:26:39 UTC

Have overnight let a second task running with the first .vdi.
It finished at 04:29 UTC today with TEN finished Condor-Jobs.

This is from the output of the new Theory_.vdi:

Last two lines in VM:

Starting libvirtd daemon: [ok]

/etc/rc3.d/S99local: line 1: /cvmfs/grid.cern.ch/vc/sbin/bootstrap: No such file or directory
bootlogd: no process killed

Must context.iso removed from Storage ?


Storage in VM
IDE Primary Master: Theory_2016_11_02.vdi (Normal 20,00 GB)
IDE Primary Slave: [Optical Drive]context.iso(356,00 KB)
IDE Secondary Master: [Optical Drive]VBoxGuestAdditions.iso (56,66 MB)

Saw hardening errors:

00:06:16.189045 supR3HardenedErrorV: supR3HardenedScreenImage/LdrLoadDll: rc=VERR_SUP_VP_NOT_OWNED_BY_TRUSTED_INSTALLER fImage=1 fProtect=0x0 fAccess=0x0 \Device\HarddiskVolume4\Users\x\AppData\Local\Microsoft\OneDrive\17.3.6720.1207\amd64\FileSyncShell64.dll:

supHardenedWinVerifyImageByHandle: TrustedInstaller is not the owner of '\Device\HarddiskVolume4\Users\x\AppData\Local\Microsoft\OneDrive\17.3.6720.1207\amd64\FileSyncShell64.dll'.

00:06:16.189206 supR3HardenedErrorV: supR3HardenedMonitor_LdrLoadDll: rejecting 'C:\Users\x\AppData\Local\Microsoft\OneDrive\17.3.6720.1207\amd64\FileSyncShell64.dll' (C:\Users\x\AppData\Local\Microsoft\OneDrive\17.3.6720.1207\amd64\FileSyncShell64.dll): rcNt=0xc0000190

00:06:16.190659 supR3HardenedErrorV: supR3HardenedScreenImage/LdrLoadDll: cached rc=VERR_SUP_VP_NOT_OWNED_BY_TRUSTED_INSTALLER fImage=1 fProtect=0x0 fAccess=0x0 cHits=1 \Device\HarddiskVolume4\Users\x\AppData\Local\Microsoft\OneDrive\17.3.6720.1207\amd64\FileSyncShell64.dll

00:06:16.190739 supR3HardenedErrorV: supR3HardenedMonitor_LdrLoadDll: rejecting 'C:\Users\x\AppData\Local\Microsoft\OneDrive\17.3.6720.1207\amd64\FileSyncShell64.dll' (C:\Users\x\AppData\Local\Microsoft\OneDrive\17.3.6720.1207\amd64\FileSyncShell64.dll): rcNt=0xc0000190

00:06:16.192151 supR3HardenedErrorV: supR3HardenedScreenImage/LdrLoadDll: cached rc=VERR_SUP_VP_NOT_OWNED_BY_TRUSTED_INSTALLER fImage=1 fProtect=0x0 fAccess=0x0 cHits=2 \Device\HarddiskVolume4\Users\x\AppData\Local\Microsoft\OneDrive\17.3.6720.1207\amd64\FileSyncShell64.dll

00:06:16.192175 supR3HardenedErrorV: supR3HardenedMonitor_LdrLoadDll: rejecting 'C:\Users\x\AppData\Local\Microsoft\OneDrive\17.3.6720.1207\amd64\FileSyncShell64.dll' (C:\Users\x\AppData\Local\Microsoft\OneDrive\17.3.6720.1207\amd64\FileSyncShell64.dll): rcNt=0xc0000190

00:06:16.193638 supR3HardenedErrorV: supR3HardenedScreenImage/LdrLoadDll: cached rc=VERR_SUP_VP_NOT_OWNED_BY_TRUSTED_INSTALLER fImage=1 fProtect=0x0 fAccess=0x0 cHits=3 \Device\HarddiskVolume4\Users\x\AppData\Local\Microsoft\OneDrive\17.3.6720.1207\amd64\FileSyncShell64.dll

00:06:16.193709 supR3HardenedErrorV: supR3HardenedMonitor_LdrLoadDll: rejecting 'C:\Users\x\AppData\Local\Microsoft\OneDrive\17.3.6720.1207\amd64\FileSyncShell64.dll' (C:\Users\x\AppData\Local\Microsoft\OneDrive\17.3.6720.1207\amd64\FileSyncShell64.dll): rcNt=0xc0000190

00:06:16.195147 supR3HardenedErrorV: supR3HardenedScreenImage/LdrLoadDll: cached rc=VERR_SUP_VP_NOT_OWNED_BY_TRUSTED_INSTALLER fImage=1 fProtect=0x0 fAccess=0x0 cHits=4 \Device\HarddiskVolume4\Users\x\AppData\Local\Microsoft\OneDrive\17.3.6720.1207\amd64\FileSyncShell64.dll

00:06:16.195220 supR3HardenedErrorV: supR3HardenedMonitor_LdrLoadDll: rejecting 'C:\Users\x\AppData\Local\Microsoft\OneDrive\17.3.6720.1207\amd64\FileSyncShell64.dll' (C:\Users\x\AppData\Local\Microsoft\OneDrive\17.3.6720.1207\amd64\FileSyncShell64.dll): rcNt=0xc0000190
ID: 28352 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1274
Credit: 8,480,870
RAC: 2,011
Message 28353 - Posted: 5 Jan 2017, 9:58:11 UTC - in response to Message 28352.  

Must context.iso removed from Storage ?


Yes
ID: 28353 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1274
Credit: 8,480,870
RAC: 2,011
Message 28354 - Posted: 5 Jan 2017, 11:06:41 UTC - in response to Message 28349.  

Perfect! Now take that very same VM, eject the CDROM image and replace the hard disk image with the Theory_2016_11_02.vdi from this project. Start the VM and see what happens. Note that to use the image directly in this way you need to provided the shared directory with the init_data.xml file.

VM with your proposed config started well.
After 1 minute uptime the new heartbeat file was created in the shared directory and 1 minute later a new job started.
I'll let it run overnight.

VM finished after >12 hours runtime with the Theory_2016_11_02.vdi image:

00:02:17.840314 VMMDev: Guest Log: [INFO] New Job Starting in slot1
00:02:17.903079 VMMDev: Guest Log: [INFO] Condor JobID: 849938. 0 in slot1
00:02:23.087032 VMMDev: Guest Log: [INFO] MCPlots JobID: 34607506 in slot1
00:59:57.017772 VMMDev: Guest Log: [INFO] Job finished in slot1 with 0.
01:00:00.343660 VMMDev: Guest Log: [INFO] New Job Starting in slot1
01:00:00.637763 VMMDev: Guest Log: [INFO] Condor JobID: 829749. 0 in slot1
01:00:05.946158 VMMDev: Guest Log: [INFO] MCPlots JobID: 34587424 in slot1
03:39:29.935414 VMMDev: Guest Log: [INFO] Job finished in slot1 with 0.
03:39:33.953839 VMMDev: Guest Log: [INFO] New Job Starting in slot1
03:39:34.263818 VMMDev: Guest Log: [INFO] Condor JobID: 853036. 0 in slot1
03:39:39.798883 VMMDev: Guest Log: [INFO] MCPlots JobID: 34610676 in slot1
12:25:44.599676 VMMDev: Guest Log: [INFO] Job finished in slot1 with 0.
12:36:09.987195 VMMDev: Guest Log: [INFO] Condor exited with return value N/A.
12:36:10.047502 VMMDev: Guest Log: [INFO] Shutting Down.


Not that many jobs, because I paused the VM sometimes for other duties or used the VM with 20% execution cap.
ID: 28354 · Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer

Send message
Joined: 20 Jun 14
Posts: 373
Credit: 238,712
RAC: 0
Message 28355 - Posted: 5 Jan 2017, 21:31:48 UTC - in response to Message 28354.  
Last modified: 5 Jan 2017, 21:41:36 UTC

So it looks like there is an issue with the Theory_2016_11_02.vdi image that we are using. As it is working for CP (Windows 7) but not for maeax and Jesse (Windows 10) this suggest a compatibility problem. My guess would be that the virtual hardware in the VM differs between the VM where the image was built and the VMs where it is failing. Please could those of you who have being testing email to me the the .vbox file for the VM.

EDIT: If anyone wants to investigate, here is an example of a .vbox file used for the build.
ID: 28355 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1274
Credit: 8,480,870
RAC: 2,011
Message 28356 - Posted: 5 Jan 2017, 22:16:03 UTC - in response to Message 28355.  

Please could those of you who have being testing email to me the the .vbox file for the VM.

I suppose, you're only interested in the *VM*.vbox file if not working correctly?
ID: 28356 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2090
Credit: 158,922,687
RAC: 124,848
Message 28357 - Posted: 6 Jan 2017, 8:09:30 UTC - in response to Message 28355.  
Last modified: 6 Jan 2017, 8:30:25 UTC


EDIT: If anyone wants to investigate, is an example of a .vbox file used for the build.


Have a SSD for storage, is this a problem for .vbox?

EDIT: Theory_.vdi have Linux_64Bits_Generic from 16/10/24 4.1.34-22
uc_.vdi have Linux_64Bits_Generic from 16/11/7 4.1.35-25

both have 4.3.28 r100309.
ID: 28357 · Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer

Send message
Joined: 20 Jun 14
Posts: 373
Credit: 238,712
RAC: 0
Message 28364 - Posted: 6 Jan 2017, 22:21:11 UTC - in response to Message 28355.  

I managed to reproduce the error on my machine by overwriting the network block in the vbox XML file of a test VM with the respective content from a vbox XML file of a VM which was not working. So it looks like that error is generated when it can't access the network. Still trying to understand what exactly is making it fail.
ID: 28364 · Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer

Send message
Joined: 20 Jun 14
Posts: 373
Credit: 238,712
RAC: 0
Message 28365 - Posted: 6 Jan 2017, 22:52:47 UTC - in response to Message 28364.  

From what I can see this doesn't work

< Adapter slot="0" enabled="true" MACAddress="080027F1E677" type="82540EM">

but this does work

< Adapter slot="0" enabled="true" MACAddress="080027F1E677" type="82540EM" cable="true" >

What cable="true" means, why it is required and why it is not there still needs to be understood.
ID: 28365 · Report as offensive     Reply Quote
m

Send message
Joined: 6 Sep 08
Posts: 116
Credit: 11,063,327
RAC: 4,484
Message 28367 - Posted: 6 Jan 2017, 23:21:49 UTC - in response to Message 28365.  
Last modified: 6 Jan 2017, 23:27:09 UTC

As I remember, we've been here before in the days of T4T. It may be that "Cable" refers to the simulated network cable seen by the VM, i.e. cable=false simulates the network cable unplugged.
This is the VBoxmanage modifyvm command that controls this, from an old manual.

--cableconnected<1-N> on|off: This allows you to temporarily disconnect a virtual network interface, as if a network cable had been pulled from a real network card. This might be useful for resetting certain software components in the VM.
ID: 28367 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2090
Credit: 158,922,687
RAC: 124,848
Message 28368 - Posted: 7 Jan 2017, 8:55:35 UTC

In a old T4T Forum saw this example of Parameter in Vboxlog:

[/Devices/e1000/0/Config/] (level 4)
00:00:01.528 AdapterType <integer> = 0x0000000000000000 (0)
00:00:01.528 cableConnected <integer> = 0x0000000000000001 (1)
00:00:01.528 LineSpeed <integer> = 0x0000000000000000 (0)
00:00:01.528 MAC <bytes> = "08 00 27 fc 44 b2" (cb=6)
ID: 28368 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1274
Credit: 8,480,870
RAC: 2,011
Message 28369 - Posted: 7 Jan 2017, 10:50:22 UTC - in response to Message 28365.  

From what I can see this doesn't work

< Adapter slot="0" enabled="true" MACAddress="080027F1E677" type="82540EM">

but this does work

< Adapter slot="0" enabled="true" MACAddress="080027F1E677" type="82540EM" cable="true" >

What cable="true" means, why it is required and why it is not there still needs to be understood.

Sorry to say Laurence, but my VM was working well and don't have the "cable="true"" at the end of the Adapter slot="0" line.
ID: 28369 · Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer

Send message
Joined: 20 Jun 14
Posts: 373
Credit: 238,712
RAC: 0
Message 28370 - Posted: 7 Jan 2017, 11:28:30 UTC - in response to Message 28369.  

Sorry to say Laurence, but my VM was working well and don't have the "cable="true"" at the end of the Adapter slot="0" line.


It could be that the value it defaults to if not set differs. What happens if you explicitly set cable="false"? Is the machine you are testing it on connected via WiFi or an Ethernet cable? If wifi, does it at least have an Ethernet port?
ID: 28370 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2090
Credit: 158,922,687
RAC: 124,848
Message 28371 - Posted: 7 Jan 2017, 11:29:21 UTC

Helo CP,

do you have Windows 7?

Laurence wrote, it is in Windows 10.

Have ucernvm-prod_.vdi with context.iso started and found in vbox.log:

00:00:02.656643 [/Devices/e1000/] (level 2)
00:00:02.656644
00:00:02.656645 [/Devices/e1000/0/] (level 3)
00:00:02.656646 PCIBusNo <integer> = 0x0000000000000000 (0)
00:00:02.656648 PCIDeviceNo <integer> = 0x0000000000000003 (3)
00:00:02.656649 PCIFunctionNo <integer> = 0x0000000000000000 (0)
00:00:02.656650 Trusted <integer> = 0x0000000000000001 (1)
00:00:02.656651
00:00:02.656652 [/Devices/e1000/0/Config/] (level 4)
00:00:02.656653 AdapterType <integer> = 0x0000000000000000 (0)
00:00:02.656655 CableConnected <integer> = 0x0000000000000001 (1)
00:00:02.656656 LineSpeed <integer> = 0x0000000000000000 (0)
00:00:02.656657 MAC <bytes> = "08 00 27 a7 21 cb" (cb=6)
00:00:02.656659
00:00:02.656660 [/Devices/e1000/0/LUN#0/] (level 4)
00:00:02.656662 Driver <string> = "NAT" (cb=4)
00:00:02.656663
00:00:02.656663 [/Devices/e1000/0/LUN#0/Config/] (level 5)
00:00:02.656666 AliasMode <integer> = 0x0000000000000000 (0)
00:00:02.656668 BootFile <string> = "L_vdi.pxe" (cb=10)
00:00:02.656669 DNSProxy <integer> = 0x0000000000000000 (0)
00:00:02.656670 Network <string> = "10.0.2.0/24" (cb=12)
00:00:02.656671 PassDomain <integer> = 0x0000000000000001 (1)
00:00:02.656672 TFTPPrefix <string> = "C:\Users\x\.VirtualBox\TFTP" (cb=36)
00:00:02.656674 UseHostResolver <integer> = 0x0000000000000000 (0)
00:00:02.656675
00:00:02.656675 [/Devices/e1000/0/LUN#999/] (level 4)
00:00:02.656677 Driver <string> = "MainStatus" (cb=11)
00:00:02.656678
00:00:02.656679 [/Devices/e1000/0/LUN#999/Config/] (level 5)
00:00:02.656681 First <integer> = 0x0000000000000000 (0)
00:00:02.656682 Last <integer> = 0x0000000000000000 (0)
00:00:02.656683 papLeds <integer> = 0x0000000001942ac8 (26 487 496)

The first Condor-Task is running at the moment!!
ID: 28371 · Report as offensive     Reply Quote
m

Send message
Joined: 6 Sep 08
Posts: 116
Credit: 11,063,327
RAC: 4,484
Message 28372 - Posted: 7 Jan 2017, 12:03:34 UTC - in response to Message 28370.  


It could be that the value it defaults to if not set differs. What happens if you explicitly set cable="false"? Is the machine you are testing it on connected via WiFi or an Ethernet cable? If wifi, does it at least have an Ethernet port?


This appears in the VM Manager GUI under "settings/network/advanced".
ID: 28372 · Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer

Send message
Joined: 20 Jun 14
Posts: 373
Credit: 238,712
RAC: 0
Message 28373 - Posted: 7 Jan 2017, 12:27:13 UTC - in response to Message 28371.  

Please could you start a new Theory task with BOINC and then exit BOINC. The VM files should be in C:\ProgramData\BOINC\slots\0\boinc_xxx/boinc_xxx.vbox. You can then edit that file to set cable="true" and then open VirtualBox to start the VM manually.
ID: 28373 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2090
Credit: 158,922,687
RAC: 124,848
Message 28374 - Posted: 7 Jan 2017, 13:40:19 UTC - in response to Message 28373.  

Please could you start a new Theory task with BOINC and then exit BOINC. The VM files should be in C:\ProgramData\BOINC\slots\0\boinc_xxx/boinc_xxx.vbox. You can then edit that file to set cable="true" and then open VirtualBox to start the VM manually.


Under Scientific Linux or Windows 10?

In Windows 10 Cable connected is on in boinc_xxx.vbox.
The task finished after 11 Min.

https://lhcathome.cern.ch/lhcathome/result.php?resultid=110942801

When Boinc is closed, the boinc_xxx.vbox is always running and don't stopp.
ID: 28374 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1274
Credit: 8,480,870
RAC: 2,011
Message 28377 - Posted: 7 Jan 2017, 18:26:14 UTC - in response to Message 28370.  

Sorry to say Laurence, but my VM was working well and don't have the "cable="true"" at the end of the Adapter slot="0" line.


It could be that the value it defaults to if not set differs. What happens if you explicitly set cable="false"? Is the machine you are testing it on connected via WiFi or an Ethernet cable? If wifi, does it at least have an Ethernet port?

Running a fresh booted VM with

<Adapter slot="0" enabled="true" MACAddress="0000000000" type="82540EM" cable="false"> (changed MAC)

in CernVM.vbox and 20 minutes later still got no job.
Heartbeat file in shared folder is refreshed frequently.
The machine is even not capable for Wifi, so that machine is on LAN-directly.
ID: 28377 · Report as offensive     Reply Quote
Jesse Viviano

Send message
Joined: 12 Feb 14
Posts: 72
Credit: 4,639,155
RAC: 0
Message 28378 - Posted: 7 Jan 2017, 18:44:53 UTC - in response to Message 28373.  

I have tried that. Adding cable="true" did not allow the VM to work. I have even gave it a hard power down, edited the .vbox file, and manually restarted the VM. That did not allow the VM to work. Have you tried using a utility like diff or WinMerge on some of the .vbox files for the VMs that do work and the VMs that do not work? I also noticed that ATLAS@home uses the same network configuration in its .vbox files, and they still work.
ID: 28378 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1274
Credit: 8,480,870
RAC: 2,011
Message 28379 - Posted: 7 Jan 2017, 18:55:03 UTC
Last modified: 7 Jan 2017, 18:58:36 UTC

After the cable="false" was removed and also after edit with cable="true", the VM is still working, but I don't get a job in any scenario.
I'll try a fresh start later with the original *.vdi.
Except a few timestamps there are no differences between the vbox-files.
ID: 28379 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · Next

Message boards : Number crunching : Missing heartbeat file errors


©2024 CERN