Questions and Answers : Unix/Linux : LHC Task errors on Ubuntu 18.04.1 with VirtualBox 5.2.10
Message board moderation

To post messages, you must log in.

AuthorMessage
Timothy Mullican

Send message
Joined: 22 Sep 18
Posts: 1
Credit: 169,693
RAC: 0
Message 37187 - Posted: 3 Nov 2018, 4:56:17 UTC

I have a 64-bit machine running Ubuntu 18.04.1 (2x E5-2670 8-core CPU w/ 128GB RAM) that seems to be failing all LHCb workloads (142 so far). See https://lhcathome.cern.ch/lhcathome/result.php?resultid=208011319 for an example. I can launch VirtualBox under the boinc account and see that a boinc virtual machine exists and is running. Below I have listed the contents of the stderr output. I have also listed my VirtualBox version for reference.

ii  virtualbox                                 5.2.10-dfsg-6ubuntu18.04.1                  amd64        x86 virtualization solution - base binaries
ii  virtualbox-dkms                            5.2.10-dfsg-6ubuntu18.04.1                  all          x86 virtualization solution - kernel module sources for dkms
ii  virtualbox-qt                              5.2.10-dfsg-6ubuntu18.04.1                  amd64        x86 virtualization solution - Qt based user interface


<core_client_version>7.9.3</core_client_version>
<![CDATA[
<message>
process exited with code 194 (0xc2, -62)</message>
<stderr_txt>
2018-11-02 22:37:29 (13736): vboxwrapper (7.7.26196): starting
2018-11-02 22:37:30 (13736): Feature: Checkpoint interval offset (530 seconds)
2018-11-02 22:37:30 (13736): Detected: VirtualBox VboxManage Interface (Version: 5.2.10)
2018-11-02 22:37:30 (13736): Detected: Minimum checkpoint interval (600.000000 seconds)
2018-11-02 22:37:30 (13736): Detected: Heartbeat check (file: 'heartbeat' every 1200.000000 seconds)
2018-11-02 22:37:30 (13736): Successfully copied 'init_data.xml' to the shared directory.
2018-11-02 22:37:30 (13736): Create VM. (boinc_5b589ba7ed7d5e3c, slot#14)
2018-11-02 22:37:30 (13736): Setting Memory Size for VM. (42348MB)
2018-11-02 22:37:30 (13736): Setting CPU Count for VM. (32)
2018-11-02 22:37:30 (13736): Setting Chipset Options for VM.
2018-11-02 22:37:30 (13736): Setting Boot Options for VM.
2018-11-02 22:37:30 (13736): Setting Network Configuration for NAT.
2018-11-02 22:37:31 (13736): Enabling VM Network Access.
2018-11-02 22:37:31 (13736): Disabling USB Support for VM.
2018-11-02 22:37:31 (13736): Disabling COM Port Support for VM.
2018-11-02 22:37:31 (13736): Disabling LPT Port Support for VM.
2018-11-02 22:37:31 (13736): Disabling Audio Support for VM.
2018-11-02 22:37:31 (13736): Disabling Clipboard Support for VM.
2018-11-02 22:37:31 (13736): Disabling Drag and Drop Support for VM.
2018-11-02 22:37:31 (13736): Adding storage controller(s) to VM.
2018-11-02 22:37:31 (13736): Adding virtual disk drive to VM. (vm_image.vdi)
2018-11-02 22:37:31 (13736): Adding network bandwidth throttle group to VM. (Defaulting to 1024GB)
2018-11-02 22:37:31 (13736): forwarding host port 39209 to guest port 80
2018-11-02 22:37:31 (13736): Enabling remote desktop for VM.
2018-11-02 22:37:31 (13736): Required extension pack not installed, remote desktop not enabled.
2018-11-02 22:37:31 (13736): Enabling shared directory for VM.
2018-11-02 22:37:32 (13736): Starting VM. (boinc_5b589ba7ed7d5e3c, slot#14)
2018-11-02 22:37:33 (13736): Successfully started VM. (PID = '14098')
2018-11-02 22:37:33 (13736): Reporting VM Process ID to BOINC.
2018-11-02 22:37:33 (13736): Guest Log: BIOS: VirtualBox 5.2.10
2018-11-02 22:37:33 (13736): Guest Log: CPUID EDX: 0x178bfbff
2018-11-02 22:37:33 (13736): Guest Log: BIOS: ata0-0: PCHS=16383/16/63 LCHS=1024/255/63
2018-11-02 22:37:33 (13736): VM state change detected. (old = 'poweroff', new = 'running')
2018-11-02 22:37:33 (13736): Detected: Web Application Enabled (http://localhost:39209)
2018-11-02 22:37:34 (13736): Preference change detected
2018-11-02 22:37:34 (13736): Setting CPU throttle for VM. (100%)
2018-11-02 22:37:34 (13736): Setting checkpoint interval to 600 seconds. (Higher value of (Preference: 60 seconds) or (Vbox_job.xml: 600 seconds))
2018-11-02 22:37:36 (13736): Guest Log: BIOS: Boot : bseqnr=1, bootseq=0032
2018-11-02 22:37:36 (13736): Guest Log: BIOS: Booting from Hard Disk...
2018-11-02 22:37:37 (13736): Guest Log: BIOS: KBD: unsupported int 16h function 03
2018-11-02 22:37:37 (13736): Guest Log: BIOS: AX=0305 BX=0000 CX=0000 DX=0000 
2018-11-02 22:47:28 (13736): VM Heartbeat file specified, but missing.
2018-11-02 22:47:28 (13736): VM Heartbeat file specified, but missing file system status. (errno = '2')
2018-11-02 22:47:28 (13736): Capturing screenshot.
2018-11-02 22:47:29 (13736): Screenshot completed.
2018-11-02 22:47:29 (13736): Powering off VM.
2018-11-02 22:47:29 (13736): Successfully stopped VM.
2018-11-02 22:47:29 (13736): Deregistering VM. (boinc_5b589ba7ed7d5e3c, slot#14)
2018-11-02 22:47:29 (13736): Removing network bandwidth throttle group from VM.
2018-11-02 22:47:29 (13736): Removing storage controller(s) from VM.
2018-11-02 22:47:29 (13736): Removing VM from VirtualBox.
2018-11-02 22:47:29 (13736): Removing virtual disk drive from VirtualBox.
2018-11-02 22:47:35 (13736): Failed to open screenshot image file. (2)

    Hypervisor System Log:


    VM Execution Log:


    VM Startup Log:


    VM Trace Log:

Processor#18 speed: 3300 MHz
Processor#18 description: Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
Processor#19 speed: 3300 MHz
Processor#19 description: Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
Processor#20 speed: 3300 MHz
Processor#20 description: Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
Processor#21 speed: 3300 MHz
Processor#21 description: Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
Processor#22 speed: 3300 MHz
Processor#22 description: Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
Processor#23 speed: 3300 MHz
Processor#23 description: Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
Processor#24 speed: 3300 MHz
Processor#24 description: Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
Processor#25 speed: 3300 MHz
Processor#25 description: Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
Processor#26 speed: 3300 MHz
Processor#26 description: Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
Processor#27 speed: 3300 MHz
Processor#27 description: Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
Processor#28 speed: 3300 MHz
Processor#28 description: Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
Processor#29 speed: 3300 MHz
Processor#29 description: Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
Processor#30 speed: 3300 MHz
Processor#30 description: Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
Processor#31 speed: 3300 MHz
Processor#31 description: Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
Memory size: 128865 MByte
Memory available: 126212 MByte
Operating system: Linux
Operating system version: 4.15.0-34-generic

2018-11-02 22:37:30 (13736): 
Command: VBoxManage -q showvminfo "boinc_5b589ba7ed7d5e3c" --machinereadable 
Exit Code: -2135228415
Output:
VBoxManage: error: Could not find a registered machine named 'boinc_5b589ba7ed7d5e3c'
VBoxManage: error: Details: code VBOX_E_OBJECT_NOT_FOUND (0x80bb0001), component VirtualBoxWrap, interface IVirtualBox, callee nsISupports
VBoxManage: error: Context: "FindMachine(Bstr(VMNameOrUuid).raw(), machine.asOutParam())" at line 2834 of file VBoxManageInfo.cpp

2018-11-02 22:37:30 (13736): 
Command: VBoxManage -q showhdinfo "/var/lib/boinc-client/slots/14/vm_image.vdi" 
Exit Code: 0
Output:
UUID:           3e2dab7c-628d-4cca-8651-aaa190ca94ae
Parent UUID:    base
State:          created
Type:           normal (base)
Location:       /var/lib/boinc-client/slots/14/vm_image.vdi
Storage format: VDI
Format variant: dynamic default
Capacity:       20480 MBytes
Size on disk:   2325 MBytes
Encryption:     disabled

2018-11-02 22:37:30 (13736): 
Command: VBoxManage -q showhdinfo "/var/lib/boinc-client/slots/14/vm_image.vdi" 
Exit Code: 0
Output:
UUID:           3e2dab7c-628d-4cca-8651-aaa190ca94ae
Parent UUID:    base
State:          created
Type:           normal (base)
Location:       /var/lib/boinc-client/slots/14/vm_image.vdi
Storage format: VDI
Format variant: dynamic default
Capacity:       20480 MBytes
Size on disk:   2325 MBytes
Encryption:     disabled

2018-11-02 22:37:30 (13736): 
Command: VBoxManage -q closemedium disk "/var/lib/boinc-client/slots/14/vm_image.vdi" 
Exit Code: 0
Output:

2018-11-02 22:37:30 (13736): 
Command: VBoxManage -q createvm --name "boinc_5b589ba7ed7d5e3c" --basefolder "/var/lib/boinc-client/slots/14" --ostype "Linux26_64" --register
Exit Code: 0
Output:
Virtual machine 'boinc_5b589ba7ed7d5e3c' is created and registered.
UUID: 34f1503d-39b1-4a76-b242-93809942b150
Settings file: '/var/lib/boinc-client/slots/14/boinc_5b589ba7ed7d5e3c/boinc_5b589ba7ed7d5e3c.vbox'

2018-11-02 22:37:30 (13736): 
Command: VBoxManage -q modifyvm "boinc_5b589ba7ed7d5e3c" --description "LHCb_3945969_1540921935.969064_0" 
Exit Code: 0
Output:

2018-11-02 22:37:30 (13736): 
Command: VBoxManage -q modifyvm "boinc_5b589ba7ed7d5e3c" --memory 42348 
Exit Code: 0
Output:

2018-11-02 22:37:30 (13736): 
Command: VBoxManage -q modifyvm "boinc_5b589ba7ed7d5e3c" --cpus 32 
Exit Code: 0
Output:

2018-11-02 22:37:30 (13736): 
Command: VBoxManage -q modifyvm "boinc_5b589ba7ed7d5e3c" --acpi on --ioapic on 
Exit Code: 0
Output:

2018-11-02 22:37:30 (13736): 
Command: VBoxManage -q modifyvm "boinc_5b589ba7ed7d5e3c" --boot1 disk --boot2 dvd --boot3 none --boot4 none 
Exit Code: 0
Output:

2018-11-02 22:37:31 (13736): 
Command: VBoxManage -q modifyvm "boinc_5b589ba7ed7d5e3c" --nic1 nat --natdnsproxy1 on --cableconnected1 off 
Exit Code: 0
Output:

2018-11-02 22:37:31 (13736): 
Command: VBoxManage -q modifyvm "boinc_5b589ba7ed7d5e3c" --cableconnected1 on 
Exit Code: 0
Output:

2018-11-02 22:37:31 (13736): 
Command: VBoxManage -q modifyvm "boinc_5b589ba7ed7d5e3c" --usb off 
Exit Code: 0
Output:

2018-11-02 22:37:31 (13736): 
Command: VBoxManage -q modifyvm "boinc_5b589ba7ed7d5e3c" --uart1 off --uart2 off 
Exit Code: 0
Output:

2018-11-02 22:37:31 (13736): 
Command: VBoxManage -q modifyvm "boinc_5b589ba7ed7d5e3c" --lpt1 off --lpt2 off 
Exit Code: 0
Output:

2018-11-02 22:37:31 (13736): 
Command: VBoxManage -q modifyvm "boinc_5b589ba7ed7d5e3c" --audio none 
Exit Code: 0
Output:

2018-11-02 22:37:31 (13736): 
Command: VBoxManage -q modifyvm "boinc_5b589ba7ed7d5e3c" --clipboard disabled 
Exit Code: 0
Output:

2018-11-02 22:37:31 (13736): 
Command: VBoxManage -q modifyvm "boinc_5b589ba7ed7d5e3c" --draganddrop disabled 
Exit Code: 0
Output:

2018-11-02 22:37:31 (13736): 
Command: VBoxManage -q storagectl "boinc_5b589ba7ed7d5e3c" --name "Hard Disk Controller" --add "ide" --controller "PIIX4" 
Exit Code: 0
Output:

2018-11-02 22:37:31 (13736): 
Command: VBoxManage -q storageattach "boinc_5b589ba7ed7d5e3c" --storagectl "Hard Disk Controller" --port 0 --device 0 --type hdd --setuuid "" --medium "/var/lib/boinc-client/slots/14/vm_image.vdi" 
Exit Code: 0
Output:

2018-11-02 22:37:31 (13736): 
Command: VBoxManage -q bandwidthctl "boinc_5b589ba7ed7d5e3c" add "boinc_5b589ba7ed7d5e3c_net" --type network --limit 1024G 
Exit Code: 0
Output:

2018-11-02 22:37:31 (13736): 
Command: VBoxManage -q modifyvm "boinc_5b589ba7ed7d5e3c" --natpf1 ",tcp,127.0.0.1,39209,,80" 
Exit Code: 0
Output:

2018-11-02 22:37:31 (13736): 
Command: VBoxManage -q list extpacks
Exit Code: 0
Output:
Extension Packs: 1
Pack no. 0:   VNC
Version:      5.2.10
Revision:     121806
Edition:      
Description:  VNC plugin module
VRDE Module:  VBoxVNC
Usable:       true 
Why unusable: 

2018-11-02 22:37:32 (13736): 
Command: VBoxManage -q sharedfolder add "boinc_5b589ba7ed7d5e3c" --name "shared" --hostpath "/var/lib/boinc-client/slots/14/shared"
Exit Code: 0
Output:

2018-11-02 22:37:32 (13736): 
Command: VBoxManage -q startvm "boinc_5b589ba7ed7d5e3c" --type headless
Exit Code: 0
Output:
Waiting for VM "boinc_5b589ba7ed7d5e3c" to power on...
VM "boinc_5b589ba7ed7d5e3c" has been successfully started.

2018-11-02 22:37:34 (13736): 
Command: VBoxManage -q controlvm "boinc_5b589ba7ed7d5e3c" cpuexecutioncap 100 
Exit Code: 0
Output:

2018-11-02 22:47:28 (13736): 
Command: VBoxManage -q controlvm "boinc_5b589ba7ed7d5e3c" keyboardputscancode 0x39
Exit Code: 0
Output:
VBoxManage: error: Error: '0x39' is not a hex byte!

2018-11-02 22:47:29 (13736): 
Command: VBoxManage -q controlvm "boinc_5b589ba7ed7d5e3c" screenshotpng "/var/lib/boinc-client/slots/14/vbox_screenshot.png"
Exit Code: 0
Output:

2018-11-02 22:47:29 (13736): 
Command: VBoxManage -q controlvm "boinc_5b589ba7ed7d5e3c" poweroff
Exit Code: 0
Output:
0%...10%...20%...30%...40%...50%...60%...70%...80%...90%...100%

2018-11-02 22:47:29 (13736): 
Command: VBoxManage -q snapshot "boinc_5b589ba7ed7d5e3c" list 
Exit Code: 0
Output:
This machine does not have any snapshots

2018-11-02 22:47:29 (13736): 
Command: VBoxManage -q bandwidthctl "boinc_5b589ba7ed7d5e3c" remove "boinc_5b589ba7ed7d5e3c_net" 
Exit Code: 0
Output:

2018-11-02 22:47:29 (13736): 
Command: VBoxManage -q storagectl "boinc_5b589ba7ed7d5e3c" --name "Hard Disk Controller" --remove 
Exit Code: 0
Output:

2018-11-02 22:47:29 (13736): 
Command: VBoxManage -q unregistervm "boinc_5b589ba7ed7d5e3c" --delete 
Exit Code: 0
Output:
0%...10%...20%...30%...40%...50%...60%...70%...80%...90%...100%

2018-11-02 22:47:30 (13736): 
Command: VBoxManage -q closemedium disk "/var/lib/boinc-client/slots/14/vm_image.vdi" --delete 
Exit Code: 0
Output:
0%...10%...20%...30%...40%...50%...60%...70%...80%...90%...100%

22:47:35 (13736): called boinc_finish(194)

</stderr_txt>
]]>

ID: 37187 · Report as offensive     Reply Quote
bronco

Send message
Joined: 13 Apr 18
Posts: 443
Credit: 8,438,885
RAC: 0
Message 37190 - Posted: 3 Nov 2018, 7:10:56 UTC - in response to Message 37187.  

1) Click on my name, follow the links, drill down and note that you can see my hosts, all the tasks my hosts run and the stderr output for each task and note that there is no need for you to waste space here with pastes of complete stderr outputs. If you want to draw attention to a particular task and its stderr output then paste the URL to that task, highlight the URL (hold SHIFT while dragging mouse over the URL) then click the URL button above the edit box to transform the URL into a hyperlink in the final post.

2) LHCb tasks are totally fubar. Even when they validate they don't do any work. They're a complete waste of your resources. Disable them and forget about LHCb. If I had to hazard a guess as to why they are failing I would say it's because you are allocating all 32 cores to them which takes such a long time to setup the VM fails to produce the heartbeat on time and then BOINC kills the task. Check the stderr output and note the "no heartbeat" error is what's doing it.

3) At the moment only the Sixtrack, ATLAS and Theory apps work. Go to your website prefs and disable all the apps except those 3 and uncheck the "If no work for selected applications is available, accept work from other applications" box.

4) You completed and validated a number of ATLAS tasks recently. Those ran on 8 cores instead of 32 which reinforces my earlier suggestion regarding 32 being too many. Not sure if you're using an app_config.xml to specify cores or just using the website prefs. I suspect you are using website prefs and have "Max # of cores" and "Max # of CPUs" set to unlimited. Set both of those to 8 for now. Then in BOINC manager set LHC to "no new tasks", abort all the tasks in your cache, set LHC to "allow new tasks". Whatever you do, do NOT leave Max # of tasks set to unlimited... it's causing your host to download far more tasks than it can process which is why the server is canceling tasks... they were not processed in time.

Try the above adjustments for now and see what happens.
ID: 37190 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2410
Credit: 226,001,957
RAC: 125,858
Message 37202 - Posted: 3 Nov 2018, 11:47:11 UTC - in response to Message 37187.  

I didn't check how many VMs you try to run concurrently but the may eat up all of your RAM.

While ATLAS is restricted to use not more than 8 cores/10.2 GB with default settings, LHCb really tries to use all of your cores (32) which results in a RAM request of roughly 41 GB per VM. Even if those VMs do no real work the RAM is not available for other use and your computer may start to swap sooner or later.
The latter is most likely the reason why the heartbeat file is delayed.

As already suggested you should explicitly limit the #cores. This will automatically limit the (default) RAM setting.
ID: 37202 · Report as offensive     Reply Quote
bronco

Send message
Joined: 13 Apr 18
Posts: 443
Credit: 8,438,885
RAC: 0
Message 37203 - Posted: 3 Nov 2018, 12:09:18 UTC - in response to Message 37202.  

Swapping was my first thought too but then I checked the host's details. It has 128 GB RAM. It depends on what other apps are running and how much RAM they require.
ID: 37203 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2410
Credit: 226,001,957
RAC: 125,858
Message 37204 - Posted: 3 Nov 2018, 12:30:12 UTC - in response to Message 37203.  

You think 128 GB RAM prevent a host from swapping?
Maybe, if you only play tetris.
ID: 37204 · Report as offensive     Reply Quote
bronco

Send message
Joined: 13 Apr 18
Posts: 443
Credit: 8,438,885
RAC: 0
Message 37205 - Posted: 3 Nov 2018, 13:34:00 UTC - in response to Message 37204.  

No, I don't think that. And that's why I didn't say that. Why are you so desperate to make it appear that I did?
ID: 37205 · Report as offensive     Reply Quote

Questions and Answers : Unix/Linux : LHC Task errors on Ubuntu 18.04.1 with VirtualBox 5.2.10


©2024 CERN