Message boards : ATLAS application : Computing error
Message board moderation

To post messages, you must log in.

AuthorMessage
Trotador

Send message
Joined: 14 May 15
Posts: 17
Credit: 11,627,311
RAC: 0
Message 36797 - Posted: 21 Sep 2018, 4:58:26 UTC

I'm having this error in one of my hosts in many WUs as of late, but not all WUs, maybe since beginning of this week (after outage?), before everything seemed to go more or less OK. I'm crunching 6 WU, 8 cores each, the host has 96 GB RAM, Is lack of enough RAM?, it was running 8 WUs before OK. In other hosts I get also this error but in very few WUs

Any idea?.

https://lhcathome.cern.ch/lhcathome/results.php?hostid=10564024
https://lhcathome.cern.ch/lhcathome/results.php?hostid=10558422

<core_client_version>7.6.31</core_client_version>
<![CDATA[
<stderr_txt>
2018-09-20 23:45:46 (54265): vboxwrapper (7.7.26196): starting
2018-09-20 23:45:47 (54265): Feature: Checkpoint interval offset (466 seconds)
2018-09-20 23:45:47 (54265): Detected: VirtualBox VboxManage Interface (Version: 5.2.18)
2018-09-20 23:45:47 (54265): Detected: Minimum checkpoint interval (900.000000 seconds)
2018-09-20 23:45:47 (54265): Successfully copied 'init_data.xml' to the shared directory.
2018-09-20 23:45:47 (54265): Create VM. (boinc_b70d56aa415278d2, slot#6)
2018-09-20 23:45:53 (54265): Setting Memory Size for VM. (10200MB)
2018-09-20 23:45:54 (54265): Setting CPU Count for VM. (8)
2018-09-20 23:45:54 (54265): Setting Chipset Options for VM.
2018-09-20 23:45:54 (54265): Setting Boot Options for VM.
2018-09-20 23:45:54 (54265): Setting Network Configuration for NAT.
2018-09-20 23:45:54 (54265): Enabling VM Network Access.
2018-09-20 23:45:54 (54265): Disabling USB Support for VM.
2018-09-20 23:45:54 (54265): Disabling COM Port Support for VM.
2018-09-20 23:45:55 (54265): Disabling LPT Port Support for VM.
2018-09-20 23:45:55 (54265): Disabling Audio Support for VM.
2018-09-20 23:45:55 (54265): Disabling Clipboard Support for VM.
2018-09-20 23:45:55 (54265): Disabling Drag and Drop Support for VM.
2018-09-20 23:45:55 (54265): Adding storage controller(s) to VM.
2018-09-20 23:45:55 (54265): Adding virtual disk drive to VM. (vm_image.vdi)
2018-09-20 23:45:55 (54265): Adding VirtualBox Guest Additions to VM.
2018-09-20 23:45:55 (54265): Adding network bandwidth throttle group to VM. (Defaulting to 1024GB)
2018-09-20 23:45:56 (54265): forwarding host port 57873 to guest port 80
2018-09-20 23:45:56 (54265): Enabling remote desktop for VM.
2018-09-20 23:45:56 (54265): Required extension pack not installed, remote desktop not enabled.
2018-09-20 23:45:56 (54265): Enabling shared directory for VM.
2018-09-20 23:45:56 (54265): Starting VM. (boinc_b70d56aa415278d2, slot#6)
2018-09-20 23:45:59 (54265): Successfully started VM. (PID = '56560')
2018-09-20 23:45:59 (54265): Reporting VM Process ID to BOINC.
2018-09-20 23:46:07 (54265): Guest Log: BIOS: VirtualBox 5.2.18
2018-09-20 23:46:07 (54265): Guest Log: CPUID EDX: 0x178bfbff
2018-09-20 23:46:07 (54265): Guest Log: BIOS: ata0-0: PCHS=16383/16/63 LCHS=1024/255/63
2018-09-20 23:46:07 (54265): Guest Log: BIOS: Boot : bseqnr=1, bootseq=0032
2018-09-20 23:46:07 (54265): Guest Log: BIOS: Booting from Hard Disk...
2018-09-20 23:46:07 (54265): Guest Log: BIOS: KBD: unsupported int 16h function 03
2018-09-20 23:46:07 (54265): Guest Log: BIOS: AX=0305 BX=0000 CX=0000 DX=0000
2018-09-20 23:46:07 (54265): VM state change detected. (old = 'poweroff', new = 'running')
2018-09-20 23:46:07 (54265): Detected: Web Application Enabled (http://localhost:57873)
2018-09-20 23:46:07 (54265): Preference change detected
2018-09-20 23:46:07 (54265): Setting CPU throttle for VM. (100%)
2018-09-20 23:46:12 (54265): Setting checkpoint interval to 900 seconds. (Higher value of (Preference: 600 seconds) or (Vbox_job.xml: 900 seconds))
2018-09-20 23:47:22 (54265): Guest Log: vboxguest: major 0, IRQ 20, I/O port d020, MMIO at 00000000f0400000 (size 0x400000)
2018-09-20 23:47:34 (54265): Guest Log: VBoxGuest: VBoxGuestCommonGuestCapsAcquire: pSession(0xffff88028dcbac10), OR(0x0), NOT(0xffffffff), flags(0x0)
2018-09-20 23:47:34 (54265): Guest Log: VBoxGuest: VBoxGuestCommonGuestCapsAcquire: pSession(0xffff880287253610), OR(0x0), NOT(0xffffffff), flags(0x0)
2018-09-20 23:47:34 (54265): Guest Log: VBoxGuest: VBoxGuestCommonGuestCapsAcquire: pSession(0xffff88028dcba810), OR(0x0), NOT(0xffffffff), flags(0x0)
2018-09-20 23:47:34 (54265): Guest Log: VBoxGuest: VBoxGuestCommonGuestCapsAcquire: pSession(0xffff880287276810), OR(0x0), NOT(0xffffffff), flags(0x0)
2018-09-20 23:48:15 (54265): Guest Log: Copying input files into RunAtlas.
2018-09-20 23:48:19 (54265): Guest Log: Copied input files into RunAtlas.
2018-09-20 23:48:28 (54265): Guest Log: copied the webapp to /var/www
2018-09-20 23:48:28 (54265): Guest Log: This vm does not need to setup http proxy
2018-09-20 23:48:28 (54265): Guest Log: ATHENA_PROC_NUMBER=8
2018-09-20 23:48:29 (54265): Guest Log: Starting ATLAS job. (PandaID=4064500070 taskID=15385155)
2018-09-20 23:59:05 (54265): Preference change detected
2018-09-20 23:59:05 (54265): Setting CPU throttle for VM. (100%)
2018-09-20 23:59:05 (54265): Setting checkpoint interval to 900 seconds. (Higher value of (Preference: 600 seconds) or (Vbox_job.xml: 900 seconds))
2018-09-21 00:05:13 (54265): Preference change detected
2018-09-21 00:05:13 (54265): Setting CPU throttle for VM. (100%)
2018-09-21 00:05:13 (54265): Setting checkpoint interval to 900 seconds. (Higher value of (Preference: 600 seconds) or (Vbox_job.xml: 900 seconds))
2018-09-21 00:27:04 (54265): Preference change detected
2018-09-21 00:27:04 (54265): Setting CPU throttle for VM. (100%)
2018-09-21 00:27:04 (54265): Setting checkpoint interval to 900 seconds. (Higher value of (Preference: 600 seconds) or (Vbox_job.xml: 900 seconds))
2018-09-21 01:25:18 (54265): Status Report: Elapsed Time: '6000.363421'
2018-09-21 01:25:18 (54265): Status Report: CPU Time: '41699.990000'
2018-09-21 02:36:24 (54265): VM is no longer is a running state. It is in 'poweroff'.
2018-09-21 02:36:24 (54265): VM state change detected. (old = 'running', new = 'poweroff')
2018-09-21 02:36:24 (54265): Powering off VM.
2018-09-21 02:36:24 (54265): Deregistering VM. (boinc_b70d56aa415278d2, slot#6)
2018-09-21 02:36:24 (54265): Removing network bandwidth throttle group from VM.
2018-09-21 02:36:24 (54265): Removing storage controller(s) from VM.
2018-09-21 02:36:24 (54265): Removing VM from VirtualBox.
2018-09-21 02:36:25 (54265): Removing virtual disk drive from VirtualBox.
2018-09-21 02:36:30 (54265): Virtual machine exited.
02:36:30 (54265): called boinc_finish(0)

</stderr_txt>
<message>
upload failure: <file_xfer_error>
<file_name>O47KDms37MtnlyackoJh5iwnABFKDmABFKDmTdiMDmABFKDmXHHw9m_2_r254113286_ATLAS_result</file_name>
<error_code>-161 (not found)</error_code>
</file_xfer_error>

</message>
]]>
ID: 36797 · Report as offensive     Reply Quote
bronco

Send message
Joined: 13 Apr 18
Posts: 443
Credit: 8,438,885
RAC: 0
Message 36799 - Posted: 21 Sep 2018, 7:49:00 UTC - in response to Message 36797.  

You are running Theory tasks on one host and ATLAS tasks on the other.

According to David Cameron in https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4178&postid=29560#29560 the RAM formula for ATLAS VBox tasks is: 3 GB + 0.9 GB * ncores.
For 6 X 8 core tasks: 6 * ( 3 + 0.9 * 8) = 61.2 GB
For 8 X 8 core tasks: 8 * ( 3 + 0.9 * 8) = 81.6 GB

According to Crystal Pellet in https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4790&postid=36791#36791 the RAM formula for Theory tasks is: 630 MB + 100 MB * ncores
For 6 X 8 core tasks: 6 * ( 630 + 100 * 8) = 8580 MB = 8.4 GB

So you have enough RAM. Most likely it's the recent outage.
ID: 36799 · Report as offensive     Reply Quote
Trotador

Send message
Joined: 14 May 15
Posts: 17
Credit: 11,627,311
RAC: 0
Message 36805 - Posted: 21 Sep 2018, 18:31:53 UTC

Yes, that calculations is what I made but errors occurred, randomly apparently but constantly.

Analyzing in VirtualBox the data of the VMs I find two different sizes for the VMs base memory: 5000Mb and 10200Mb. The later are the ones failing. I reduced to 5 WUs per host and no error so far.

I do not see any difference in the WU name construction that allow identifying them and I can not say it is not something in my end.
ID: 36805 · Report as offensive     Reply Quote

Message boards : ATLAS application : Computing error


©2024 CERN