Questions and Answers :
Unix/Linux :
LHC Task errors on Ubuntu 18.04.1 with VirtualBox 5.2.10
Message board moderation
Author | Message |
---|---|
Send message Joined: 22 Sep 18 Posts: 1 Credit: 169,693 RAC: 0 |
I have a 64-bit machine running Ubuntu 18.04.1 (2x E5-2670 8-core CPU w/ 128GB RAM) that seems to be failing all LHCb workloads (142 so far). See https://lhcathome.cern.ch/lhcathome/result.php?resultid=208011319 for an example. I can launch VirtualBox under the boinc account and see that a boinc virtual machine exists and is running. Below I have listed the contents of the stderr output. I have also listed my VirtualBox version for reference. ii virtualbox 5.2.10-dfsg-6ubuntu18.04.1 amd64 x86 virtualization solution - base binaries ii virtualbox-dkms 5.2.10-dfsg-6ubuntu18.04.1 all x86 virtualization solution - kernel module sources for dkms ii virtualbox-qt 5.2.10-dfsg-6ubuntu18.04.1 amd64 x86 virtualization solution - Qt based user interface <core_client_version>7.9.3</core_client_version> <![CDATA[ <message> process exited with code 194 (0xc2, -62)</message> <stderr_txt> 2018-11-02 22:37:29 (13736): vboxwrapper (7.7.26196): starting 2018-11-02 22:37:30 (13736): Feature: Checkpoint interval offset (530 seconds) 2018-11-02 22:37:30 (13736): Detected: VirtualBox VboxManage Interface (Version: 5.2.10) 2018-11-02 22:37:30 (13736): Detected: Minimum checkpoint interval (600.000000 seconds) 2018-11-02 22:37:30 (13736): Detected: Heartbeat check (file: 'heartbeat' every 1200.000000 seconds) 2018-11-02 22:37:30 (13736): Successfully copied 'init_data.xml' to the shared directory. 2018-11-02 22:37:30 (13736): Create VM. (boinc_5b589ba7ed7d5e3c, slot#14) 2018-11-02 22:37:30 (13736): Setting Memory Size for VM. (42348MB) 2018-11-02 22:37:30 (13736): Setting CPU Count for VM. (32) 2018-11-02 22:37:30 (13736): Setting Chipset Options for VM. 2018-11-02 22:37:30 (13736): Setting Boot Options for VM. 2018-11-02 22:37:30 (13736): Setting Network Configuration for NAT. 2018-11-02 22:37:31 (13736): Enabling VM Network Access. 2018-11-02 22:37:31 (13736): Disabling USB Support for VM. 2018-11-02 22:37:31 (13736): Disabling COM Port Support for VM. 2018-11-02 22:37:31 (13736): Disabling LPT Port Support for VM. 2018-11-02 22:37:31 (13736): Disabling Audio Support for VM. 2018-11-02 22:37:31 (13736): Disabling Clipboard Support for VM. 2018-11-02 22:37:31 (13736): Disabling Drag and Drop Support for VM. 2018-11-02 22:37:31 (13736): Adding storage controller(s) to VM. 2018-11-02 22:37:31 (13736): Adding virtual disk drive to VM. (vm_image.vdi) 2018-11-02 22:37:31 (13736): Adding network bandwidth throttle group to VM. (Defaulting to 1024GB) 2018-11-02 22:37:31 (13736): forwarding host port 39209 to guest port 80 2018-11-02 22:37:31 (13736): Enabling remote desktop for VM. 2018-11-02 22:37:31 (13736): Required extension pack not installed, remote desktop not enabled. 2018-11-02 22:37:31 (13736): Enabling shared directory for VM. 2018-11-02 22:37:32 (13736): Starting VM. (boinc_5b589ba7ed7d5e3c, slot#14) 2018-11-02 22:37:33 (13736): Successfully started VM. (PID = '14098') 2018-11-02 22:37:33 (13736): Reporting VM Process ID to BOINC. 2018-11-02 22:37:33 (13736): Guest Log: BIOS: VirtualBox 5.2.10 2018-11-02 22:37:33 (13736): Guest Log: CPUID EDX: 0x178bfbff 2018-11-02 22:37:33 (13736): Guest Log: BIOS: ata0-0: PCHS=16383/16/63 LCHS=1024/255/63 2018-11-02 22:37:33 (13736): VM state change detected. (old = 'poweroff', new = 'running') 2018-11-02 22:37:33 (13736): Detected: Web Application Enabled (http://localhost:39209) 2018-11-02 22:37:34 (13736): Preference change detected 2018-11-02 22:37:34 (13736): Setting CPU throttle for VM. (100%) 2018-11-02 22:37:34 (13736): Setting checkpoint interval to 600 seconds. (Higher value of (Preference: 60 seconds) or (Vbox_job.xml: 600 seconds)) 2018-11-02 22:37:36 (13736): Guest Log: BIOS: Boot : bseqnr=1, bootseq=0032 2018-11-02 22:37:36 (13736): Guest Log: BIOS: Booting from Hard Disk... 2018-11-02 22:37:37 (13736): Guest Log: BIOS: KBD: unsupported int 16h function 03 2018-11-02 22:37:37 (13736): Guest Log: BIOS: AX=0305 BX=0000 CX=0000 DX=0000 2018-11-02 22:47:28 (13736): VM Heartbeat file specified, but missing. 2018-11-02 22:47:28 (13736): VM Heartbeat file specified, but missing file system status. (errno = '2') 2018-11-02 22:47:28 (13736): Capturing screenshot. 2018-11-02 22:47:29 (13736): Screenshot completed. 2018-11-02 22:47:29 (13736): Powering off VM. 2018-11-02 22:47:29 (13736): Successfully stopped VM. 2018-11-02 22:47:29 (13736): Deregistering VM. (boinc_5b589ba7ed7d5e3c, slot#14) 2018-11-02 22:47:29 (13736): Removing network bandwidth throttle group from VM. 2018-11-02 22:47:29 (13736): Removing storage controller(s) from VM. 2018-11-02 22:47:29 (13736): Removing VM from VirtualBox. 2018-11-02 22:47:29 (13736): Removing virtual disk drive from VirtualBox. 2018-11-02 22:47:35 (13736): Failed to open screenshot image file. (2) Hypervisor System Log: VM Execution Log: VM Startup Log: VM Trace Log: Processor#18 speed: 3300 MHz Processor#18 description: Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz Processor#19 speed: 3300 MHz Processor#19 description: Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz Processor#20 speed: 3300 MHz Processor#20 description: Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz Processor#21 speed: 3300 MHz Processor#21 description: Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz Processor#22 speed: 3300 MHz Processor#22 description: Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz Processor#23 speed: 3300 MHz Processor#23 description: Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz Processor#24 speed: 3300 MHz Processor#24 description: Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz Processor#25 speed: 3300 MHz Processor#25 description: Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz Processor#26 speed: 3300 MHz Processor#26 description: Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz Processor#27 speed: 3300 MHz Processor#27 description: Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz Processor#28 speed: 3300 MHz Processor#28 description: Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz Processor#29 speed: 3300 MHz Processor#29 description: Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz Processor#30 speed: 3300 MHz Processor#30 description: Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz Processor#31 speed: 3300 MHz Processor#31 description: Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz Memory size: 128865 MByte Memory available: 126212 MByte Operating system: Linux Operating system version: 4.15.0-34-generic 2018-11-02 22:37:30 (13736): Command: VBoxManage -q showvminfo "boinc_5b589ba7ed7d5e3c" --machinereadable Exit Code: -2135228415 Output: VBoxManage: error: Could not find a registered machine named 'boinc_5b589ba7ed7d5e3c' VBoxManage: error: Details: code VBOX_E_OBJECT_NOT_FOUND (0x80bb0001), component VirtualBoxWrap, interface IVirtualBox, callee nsISupports VBoxManage: error: Context: "FindMachine(Bstr(VMNameOrUuid).raw(), machine.asOutParam())" at line 2834 of file VBoxManageInfo.cpp 2018-11-02 22:37:30 (13736): Command: VBoxManage -q showhdinfo "/var/lib/boinc-client/slots/14/vm_image.vdi" Exit Code: 0 Output: UUID: 3e2dab7c-628d-4cca-8651-aaa190ca94ae Parent UUID: base State: created Type: normal (base) Location: /var/lib/boinc-client/slots/14/vm_image.vdi Storage format: VDI Format variant: dynamic default Capacity: 20480 MBytes Size on disk: 2325 MBytes Encryption: disabled 2018-11-02 22:37:30 (13736): Command: VBoxManage -q showhdinfo "/var/lib/boinc-client/slots/14/vm_image.vdi" Exit Code: 0 Output: UUID: 3e2dab7c-628d-4cca-8651-aaa190ca94ae Parent UUID: base State: created Type: normal (base) Location: /var/lib/boinc-client/slots/14/vm_image.vdi Storage format: VDI Format variant: dynamic default Capacity: 20480 MBytes Size on disk: 2325 MBytes Encryption: disabled 2018-11-02 22:37:30 (13736): Command: VBoxManage -q closemedium disk "/var/lib/boinc-client/slots/14/vm_image.vdi" Exit Code: 0 Output: 2018-11-02 22:37:30 (13736): Command: VBoxManage -q createvm --name "boinc_5b589ba7ed7d5e3c" --basefolder "/var/lib/boinc-client/slots/14" --ostype "Linux26_64" --register Exit Code: 0 Output: Virtual machine 'boinc_5b589ba7ed7d5e3c' is created and registered. UUID: 34f1503d-39b1-4a76-b242-93809942b150 Settings file: '/var/lib/boinc-client/slots/14/boinc_5b589ba7ed7d5e3c/boinc_5b589ba7ed7d5e3c.vbox' 2018-11-02 22:37:30 (13736): Command: VBoxManage -q modifyvm "boinc_5b589ba7ed7d5e3c" --description "LHCb_3945969_1540921935.969064_0" Exit Code: 0 Output: 2018-11-02 22:37:30 (13736): Command: VBoxManage -q modifyvm "boinc_5b589ba7ed7d5e3c" --memory 42348 Exit Code: 0 Output: 2018-11-02 22:37:30 (13736): Command: VBoxManage -q modifyvm "boinc_5b589ba7ed7d5e3c" --cpus 32 Exit Code: 0 Output: 2018-11-02 22:37:30 (13736): Command: VBoxManage -q modifyvm "boinc_5b589ba7ed7d5e3c" --acpi on --ioapic on Exit Code: 0 Output: 2018-11-02 22:37:30 (13736): Command: VBoxManage -q modifyvm "boinc_5b589ba7ed7d5e3c" --boot1 disk --boot2 dvd --boot3 none --boot4 none Exit Code: 0 Output: 2018-11-02 22:37:31 (13736): Command: VBoxManage -q modifyvm "boinc_5b589ba7ed7d5e3c" --nic1 nat --natdnsproxy1 on --cableconnected1 off Exit Code: 0 Output: 2018-11-02 22:37:31 (13736): Command: VBoxManage -q modifyvm "boinc_5b589ba7ed7d5e3c" --cableconnected1 on Exit Code: 0 Output: 2018-11-02 22:37:31 (13736): Command: VBoxManage -q modifyvm "boinc_5b589ba7ed7d5e3c" --usb off Exit Code: 0 Output: 2018-11-02 22:37:31 (13736): Command: VBoxManage -q modifyvm "boinc_5b589ba7ed7d5e3c" --uart1 off --uart2 off Exit Code: 0 Output: 2018-11-02 22:37:31 (13736): Command: VBoxManage -q modifyvm "boinc_5b589ba7ed7d5e3c" --lpt1 off --lpt2 off Exit Code: 0 Output: 2018-11-02 22:37:31 (13736): Command: VBoxManage -q modifyvm "boinc_5b589ba7ed7d5e3c" --audio none Exit Code: 0 Output: 2018-11-02 22:37:31 (13736): Command: VBoxManage -q modifyvm "boinc_5b589ba7ed7d5e3c" --clipboard disabled Exit Code: 0 Output: 2018-11-02 22:37:31 (13736): Command: VBoxManage -q modifyvm "boinc_5b589ba7ed7d5e3c" --draganddrop disabled Exit Code: 0 Output: 2018-11-02 22:37:31 (13736): Command: VBoxManage -q storagectl "boinc_5b589ba7ed7d5e3c" --name "Hard Disk Controller" --add "ide" --controller "PIIX4" Exit Code: 0 Output: 2018-11-02 22:37:31 (13736): Command: VBoxManage -q storageattach "boinc_5b589ba7ed7d5e3c" --storagectl "Hard Disk Controller" --port 0 --device 0 --type hdd --setuuid "" --medium "/var/lib/boinc-client/slots/14/vm_image.vdi" Exit Code: 0 Output: 2018-11-02 22:37:31 (13736): Command: VBoxManage -q bandwidthctl "boinc_5b589ba7ed7d5e3c" add "boinc_5b589ba7ed7d5e3c_net" --type network --limit 1024G Exit Code: 0 Output: 2018-11-02 22:37:31 (13736): Command: VBoxManage -q modifyvm "boinc_5b589ba7ed7d5e3c" --natpf1 ",tcp,127.0.0.1,39209,,80" Exit Code: 0 Output: 2018-11-02 22:37:31 (13736): Command: VBoxManage -q list extpacks Exit Code: 0 Output: Extension Packs: 1 Pack no. 0: VNC Version: 5.2.10 Revision: 121806 Edition: Description: VNC plugin module VRDE Module: VBoxVNC Usable: true Why unusable: 2018-11-02 22:37:32 (13736): Command: VBoxManage -q sharedfolder add "boinc_5b589ba7ed7d5e3c" --name "shared" --hostpath "/var/lib/boinc-client/slots/14/shared" Exit Code: 0 Output: 2018-11-02 22:37:32 (13736): Command: VBoxManage -q startvm "boinc_5b589ba7ed7d5e3c" --type headless Exit Code: 0 Output: Waiting for VM "boinc_5b589ba7ed7d5e3c" to power on... VM "boinc_5b589ba7ed7d5e3c" has been successfully started. 2018-11-02 22:37:34 (13736): Command: VBoxManage -q controlvm "boinc_5b589ba7ed7d5e3c" cpuexecutioncap 100 Exit Code: 0 Output: 2018-11-02 22:47:28 (13736): Command: VBoxManage -q controlvm "boinc_5b589ba7ed7d5e3c" keyboardputscancode 0x39 Exit Code: 0 Output: VBoxManage: error: Error: '0x39' is not a hex byte! 2018-11-02 22:47:29 (13736): Command: VBoxManage -q controlvm "boinc_5b589ba7ed7d5e3c" screenshotpng "/var/lib/boinc-client/slots/14/vbox_screenshot.png" Exit Code: 0 Output: 2018-11-02 22:47:29 (13736): Command: VBoxManage -q controlvm "boinc_5b589ba7ed7d5e3c" poweroff Exit Code: 0 Output: 0%...10%...20%...30%...40%...50%...60%...70%...80%...90%...100% 2018-11-02 22:47:29 (13736): Command: VBoxManage -q snapshot "boinc_5b589ba7ed7d5e3c" list Exit Code: 0 Output: This machine does not have any snapshots 2018-11-02 22:47:29 (13736): Command: VBoxManage -q bandwidthctl "boinc_5b589ba7ed7d5e3c" remove "boinc_5b589ba7ed7d5e3c_net" Exit Code: 0 Output: 2018-11-02 22:47:29 (13736): Command: VBoxManage -q storagectl "boinc_5b589ba7ed7d5e3c" --name "Hard Disk Controller" --remove Exit Code: 0 Output: 2018-11-02 22:47:29 (13736): Command: VBoxManage -q unregistervm "boinc_5b589ba7ed7d5e3c" --delete Exit Code: 0 Output: 0%...10%...20%...30%...40%...50%...60%...70%...80%...90%...100% 2018-11-02 22:47:30 (13736): Command: VBoxManage -q closemedium disk "/var/lib/boinc-client/slots/14/vm_image.vdi" --delete Exit Code: 0 Output: 0%...10%...20%...30%...40%...50%...60%...70%...80%...90%...100% 22:47:35 (13736): called boinc_finish(194) </stderr_txt> ]]> |
Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0 |
1) Click on my name, follow the links, drill down and note that you can see my hosts, all the tasks my hosts run and the stderr output for each task and note that there is no need for you to waste space here with pastes of complete stderr outputs. If you want to draw attention to a particular task and its stderr output then paste the URL to that task, highlight the URL (hold SHIFT while dragging mouse over the URL) then click the URL button above the edit box to transform the URL into a hyperlink in the final post. 2) LHCb tasks are totally fubar. Even when they validate they don't do any work. They're a complete waste of your resources. Disable them and forget about LHCb. If I had to hazard a guess as to why they are failing I would say it's because you are allocating all 32 cores to them which takes such a long time to setup the VM fails to produce the heartbeat on time and then BOINC kills the task. Check the stderr output and note the "no heartbeat" error is what's doing it. 3) At the moment only the Sixtrack, ATLAS and Theory apps work. Go to your website prefs and disable all the apps except those 3 and uncheck the "If no work for selected applications is available, accept work from other applications" box. 4) You completed and validated a number of ATLAS tasks recently. Those ran on 8 cores instead of 32 which reinforces my earlier suggestion regarding 32 being too many. Not sure if you're using an app_config.xml to specify cores or just using the website prefs. I suspect you are using website prefs and have "Max # of cores" and "Max # of CPUs" set to unlimited. Set both of those to 8 for now. Then in BOINC manager set LHC to "no new tasks", abort all the tasks in your cache, set LHC to "allow new tasks". Whatever you do, do NOT leave Max # of tasks set to unlimited... it's causing your host to download far more tasks than it can process which is why the server is canceling tasks... they were not processed in time. Try the above adjustments for now and see what happens. |
Send message Joined: 15 Jun 08 Posts: 2410 Credit: 226,001,957 RAC: 125,858 |
I didn't check how many VMs you try to run concurrently but the may eat up all of your RAM. While ATLAS is restricted to use not more than 8 cores/10.2 GB with default settings, LHCb really tries to use all of your cores (32) which results in a RAM request of roughly 41 GB per VM. Even if those VMs do no real work the RAM is not available for other use and your computer may start to swap sooner or later. The latter is most likely the reason why the heartbeat file is delayed. As already suggested you should explicitly limit the #cores. This will automatically limit the (default) RAM setting. |
Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0 |
Swapping was my first thought too but then I checked the host's details. It has 128 GB RAM. It depends on what other apps are running and how much RAM they require. |
Send message Joined: 15 Jun 08 Posts: 2410 Credit: 226,001,957 RAC: 125,858 |
You think 128 GB RAM prevent a host from swapping? Maybe, if you only play tetris. |
Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0 |
No, I don't think that. And that's why I didn't say that. Why are you so desperate to make it appear that I did? |
©2024 CERN