Message boards : ATLAS application : Bad WUs?
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 · Next

AuthorMessage
Jonathan

Send message
Joined: 25 Sep 17
Posts: 76
Credit: 2,265,620
RAC: 4,850
Message 45834 - Posted: 9 Dec 2021, 1:28:42 UTC

Next VM started up but it isn't running anything inside. All three virtual terminals look identical. Not getting Atlas Event Process Monitoring start up on virt term 2 nor TOP on virt term 3.
Guessing something broken inside VM.

Hung work unit below. I am just going to leave it stuck for now.

boinc_493aaf73f80ced28
Application
ATLAS Simulation 2.00 (vbox64_mt_mcore_atlas)
Name
k5pLDmaoCF0nfZGDcpSWOuwoABFKDmABFKDmQ2DbDmABFKDmy3TEWm
State
Running
Received
12/8/2021 4:18:10 PM
Report deadline
12/15/2021 4:18:08 PM
Resources
4 CPUs
Estimated computation size
43,200 GFLOPs
CPU time
00:00:32
CPU time since checkpoint
00:00:00
Elapsed time
00:21:07
Estimated time remaining
01:57:53
Fraction done
15.195%
Virtual memory size
80.82 MB
Working set size
6.45 GB
Directory
slots/1
Process ID
25680
Progress rate
43.200% per hour
Executable
vboxwrapper_26198ab7_windows_x86_64.exe
ID: 45834 · Report as offensive     Reply Quote
Jonathan

Send message
Joined: 25 Sep 17
Posts: 76
Credit: 2,265,620
RAC: 4,850
Message 45835 - Posted: 9 Dec 2021, 3:33:35 UTC - in response to Message 45834.  

stderr.txt log from stuck work unit

2021-12-08 18:59:59 (25680): Detected: vboxwrapper 26202
2021-12-08 18:59:59 (25680): Detected: BOINC client v7.16.20
2021-12-08 18:59:59 (25680): Detected: VirtualBox VboxManage Interface (Version: 6.1.30)
2021-12-08 18:59:59 (25680): Successfully copied 'init_data.xml' to the shared directory.
2021-12-08 19:00:00 (25680): Create VM. (boinc_493aaf73f80ced28, slot#1)
2021-12-08 19:00:01 (25680): Setting Memory Size for VM. (6600MB)
2021-12-08 19:00:01 (25680): Setting CPU Count for VM. (4)
2021-12-08 19:00:01 (25680): Setting Chipset Options for VM.
2021-12-08 19:00:02 (25680): Setting Boot Options for VM.
2021-12-08 19:00:02 (25680): Setting Network Configuration for NAT.
2021-12-08 19:00:02 (25680): Enabling VM Network Access.
2021-12-08 19:00:03 (25680): Disabling USB Support for VM.
2021-12-08 19:00:03 (25680): Disabling COM Port Support for VM.
2021-12-08 19:00:03 (25680): Disabling LPT Port Support for VM.
2021-12-08 19:00:03 (25680): Disabling Audio Support for VM.
2021-12-08 19:00:04 (25680): Disabling Clipboard Support for VM.
2021-12-08 19:00:04 (25680): Disabling Drag and Drop Support for VM.
2021-12-08 19:00:04 (25680): Adding storage controller(s) to VM.
2021-12-08 19:00:04 (25680): Adding virtual disk drive to VM. (vm_image.vdi)
2021-12-08 19:00:05 (25680): Adding VirtualBox Guest Additions to VM.
2021-12-08 19:00:05 (25680): Adding network bandwidth throttle group to VM. (Defaulting to 1024GB)
2021-12-08 19:00:05 (25680): forwarding host port 59563 to guest port 80
2021-12-08 19:00:05 (25680): Enabling remote desktop for VM.
2021-12-08 19:00:06 (25680): Enabling shared directory for VM.
2021-12-08 19:00:06 (25680): Starting VM using VBoxManage interface. (boinc_493aaf73f80ced28, slot#1)
2021-12-08 19:00:11 (25680): Successfully started VM. (PID = '1956')
2021-12-08 19:00:11 (25680): Reporting VM Process ID to BOINC.
2021-12-08 19:00:11 (25680): Guest Log: BIOS: VirtualBox 6.1.30
2021-12-08 19:00:11 (25680): Guest Log: CPUID EDX: 0x178bfbff
2021-12-08 19:00:11 (25680): Guest Log: BIOS: ata0-0: PCHS=16383/16/63 LCHS=1024/255/63
2021-12-08 19:00:11 (25680): VM state change detected. (old = 'poweredoff', new = 'running')
2021-12-08 19:00:11 (25680): Detected: Web Application Enabled (http://localhost:59563)
2021-12-08 19:00:11 (25680): Detected: Remote Desktop Enabled (localhost:59564)
2021-12-08 19:00:11 (25680): Preference change detected
2021-12-08 19:00:11 (25680): Setting CPU throttle for VM. (100%)
2021-12-08 19:00:11 (25680): Setting checkpoint interval to 900 seconds. (Higher value of (Preference: 60 seconds) or (Vbox_job.xml: 900 seconds))
2021-12-08 19:00:13 (25680): Guest Log: BIOS: Boot : bseqnr=1, bootseq=0032
2021-12-08 19:00:13 (25680): Guest Log: BIOS: Booting from Hard Disk...
2021-12-08 19:00:15 (25680): Guest Log: BIOS: KBD: unsupported int 16h function 03
2021-12-08 19:00:15 (25680): Guest Log: BIOS: AX=0305 BX=0000 CX=0000 DX=0000 
2021-12-08 19:00:15 (25680): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=81
2021-12-08 19:00:15 (25680): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=81
2021-12-08 19:00:15 (25680): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=82
2021-12-08 19:00:15 (25680): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=82
2021-12-08 19:00:15 (25680): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=83
2021-12-08 19:00:15 (25680): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=83
2021-12-08 19:00:15 (25680): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=84
2021-12-08 19:00:15 (25680): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=84
2021-12-08 19:00:15 (25680): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=85
2021-12-08 19:00:15 (25680): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=85
2021-12-08 19:00:15 (25680): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=86
2021-12-08 19:00:15 (25680): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=86
2021-12-08 19:00:15 (25680): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=87
2021-12-08 19:00:15 (25680): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=87
2021-12-08 19:00:15 (25680): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=88
2021-12-08 19:00:15 (25680): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=88
2021-12-08 19:00:15 (25680): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=89
2021-12-08 19:00:15 (25680): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=89
2021-12-08 19:00:15 (25680): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=8a
2021-12-08 19:00:15 (25680): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=8a
2021-12-08 19:00:15 (25680): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=8b
2021-12-08 19:00:15 (25680): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=8b
2021-12-08 19:00:15 (25680): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=8c
2021-12-08 19:00:15 (25680): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=8c
2021-12-08 19:00:15 (25680): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=8d
2021-12-08 19:00:15 (25680): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=8d
2021-12-08 19:00:15 (25680): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=8e
2021-12-08 19:00:15 (25680): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=8e
2021-12-08 19:00:15 (25680): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=8f
2021-12-08 19:00:15 (25680): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=8f
2021-12-08 19:00:19 (25680): Guest Log: vgdrvHeartbeatInit: Setting up heartbeat to trigger every 2000 milliseconds
2021-12-08 19:00:19 (25680): Guest Log: vboxguest: misc device minor 58, IRQ 20, I/O port d020, MMIO at 00000000f0400000 (size 0x400000)
2021-12-08 19:00:22 (25680): Guest Log: Checking CVMFS...
2021-12-08 19:00:24 (25680): Guest Log: VBoxService 5.2.32 r132073 (verbosity: 0) linux.amd64 (Jul 12 2019 10:32:28) release log
2021-12-08 19:00:24 (25680): Guest Log: 00:00:00.000155 main     Log opened 2021-12-08T19:00:22.038429000Z
2021-12-08 19:00:24 (25680): Guest Log: 00:00:00.000262 main     OS Product: Linux
2021-12-08 19:00:24 (25680): Guest Log: 00:00:00.000297 main     OS Release: 3.10.0-957.27.2.el7.x86_64
2021-12-08 19:00:24 (25680): Guest Log: 00:00:00.000325 main     OS Version: #1 SMP Mon Jul 29 17:46:05 UTC 2019
2021-12-08 19:00:24 (25680): Guest Log: 00:00:00.000353 main     Executable: /opt/VBoxGuestAdditions-5.2.32/sbin/VBoxService
2021-12-08 19:00:24 (25680): Guest Log: 00:00:00.000354 main     Process ID: 1492
2021-12-08 19:00:24 (25680): Guest Log: 00:00:00.000354 main     Package type: LINUX_64BITS_GENERIC
2021-12-08 19:00:24 (25680): Guest Log: 00:00:00.001280 main     5.2.32 r132073 started. Verbose level = 0
2021-12-08 19:00:34 (25680): Guest Log: 00:00:10.010700 timesync vgsvcTimeSyncWorker: Radical guest time change: 21 611 928 059 000ns (GuestNow=1 639 011 633 968 992 000 ns GuestLast=1 638 990 022 040 933 000 ns fSetTimeLastLoop=true )
2021-12-08 20:40:19 (25680): Status Report: Elapsed Time: '6000.000000'
2021-12-08 20:40:19 (25680): Status Report: CPU Time: '45.843750'


I looked at my successful tasks and it looks like it is hanging at the CVMFS check or something, mounting shared files and then copying files into RunAtlas.
Below is from a successful task where I think it may be failing or getting hung. Just a theory.


2021-12-08 16:23:47 (11704): Guest Log: CVMFS is ok
2021-12-08 16:23:47 (11704): Guest Log: Mounting shared directory
2021-12-08 16:23:47 (11704): Guest Log: Copying input files
2021-12-08 16:23:49 (11704): Guest Log: Copied input files into RunAtlas.
2021-12-08 16:23:50 (11704): Guest Log: copied the webapp to /var/www
2021-12-08 16:23:50 (11704): Guest Log: This VM did not configure a local http proxy via BOINC.
2021-12-08 16:23:50 (11704): Guest Log: Small home clusters do not require a local http proxy but it is suggested if
2021-12-08 16:23:50 (11704): Guest Log: more than 10 cores throughout the same LAN segment are regularly running ATLAS like tasks.
2021-12-08 16:23:50 (11704): Guest Log: Further information can be found at the LHC@home message board.
2021-12-08 16:23:50 (11704): Guest Log: Running cvmfs_config stat atlas.cern.ch
2021-12-08 16:23:50 (11704): Guest Log: VERSION PID UPTIME(M) MEM(K) REVISION EXPIRES(M) NOCATALOGS CACHEUSE(K) CACHEMAX(K) NOFDUSE NOFDMAX NOIOERR NOOPEN HITRATE(%) RX(K) SPEED(K/S) HOST PROXY ONLINE
2021-12-08 16:23:50 (11704): Guest Log: 2.6.3.0 1617 0 29828 97456 3 1 1491998 4096001 0 65024 0 102 98.0392 533 455 http://s1cern-cvmfs.openhtc.io/cvmfs/atlas.cern.ch DIRECT 1
2021-12-08 16:23:50 (11704): Guest Log: ATHENA_PROC_NUMBER=4
2021-12-08 16:23:51 (11704): Guest Log:  *** Starting ATLAS job. (PandaID=5285080426 taskID=27554476) ***
ID: 45835 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 1669
Credit: 98,243,549
RAC: 307,106
Message 45836 - Posted: 9 Dec 2021, 4:38:55 UTC
Last modified: 9 Dec 2021, 4:41:46 UTC

Single Core running is ok with using vboxwrapper from Atlas-Task.https://lhcathome.cern.ch/lhcathome/results.php?hostid=10664116
This PC using Win10-Workstation and not Win10pro.
Have to upgrade in the next few hours my last RYZEN from Win10pro to Win11pro.
After this, will testing Atlas Multicore with default vboxwrapper from Atlas in Win11pro.
ID: 45836 · Report as offensive     Reply Quote
Jonathan

Send message
Joined: 25 Sep 17
Posts: 76
Credit: 2,265,620
RAC: 4,850
Message 45837 - Posted: 9 Dec 2021, 4:43:25 UTC - in response to Message 45835.  

Tried exiting Boinc Manager and tasks and restarting Boinc. Work unit failed

Task https://lhcathome.cern.ch/lhcathome/result.php?resultid=335523334

Workunit https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=177601336
ID: 45837 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jun 08
Posts: 2100
Credit: 163,215,475
RAC: 118,220
Message 45838 - Posted: 9 Dec 2021, 7:32:18 UTC - in response to Message 45828.  

I usually run ATLAS native singlecore but to test what happens I started an ATLAS native 4-core.

Result:
Something deeper in the ATLAS multicore scripts is broken!
The task should write all task data to \slots\6 but is writing a couple of files to \slots\.
This is a major error and needs urgent investigation by the developers!

https://lhcathome.cern.ch/lhcathome/result.php?resultid=335511517
The 4-core task finished last night and got credits but since the HITS file was among the files that were written to the wrong directory level stderr.txt reported this:
[2021-12-09 03:45:01] No HITS result produced


Early this morning the same computer/BOINC client got another 2 4-core tasks (native).
So far both are running fine and keep all files within their own \slots\ directory.
ID: 45838 · Report as offensive     Reply Quote
tullio

Send message
Joined: 19 Feb 08
Posts: 704
Credit: 4,332,741
RAC: 21
Message 45839 - Posted: 9 Dec 2021, 7:58:18 UTC
Last modified: 9 Dec 2021, 7:59:01 UTC

I'vw killed three tasks on my 2 cores but one of them was doing good. The other two used very little CPU. The one good had a CPU time twice the running time.
Tullio
ID: 45839 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 1669
Credit: 98,243,549
RAC: 307,106
Message 45840 - Posted: 9 Dec 2021, 8:52:50 UTC

Win11pro successful installed.https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10631979
Two 2-core Atlas-Tasks are running. vboxwrapper 26198ab7 (default from Atlas-task).
ID: 45840 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1143
Credit: 7,004,238
RAC: 1,102
Message 45841 - Posted: 9 Dec 2021, 9:14:20 UTC - in response to Message 45823.  
Last modified: 9 Dec 2021, 14:02:29 UTC

Could anybody running one of the affected Windows computers try out the vboxwrapper that comes with CMS?
It's just to find out whether this would solve the problem or not.
After reading the problem here, I just tried one task yesterday.
I not used the actual vboxwrapper from the project, but an older version of 8-8-2016
https://lhcathome.cern.ch/lhcathome/result.php?resultid=335442116
The only thing I noticed was, it seems that it lasted longer before the 4 athena processes started (about 5 minutes)

Edit:As requested by computezrmle: I started an ATLAS task with the newest wrapper v26203 (used by CMS). Starting fine and processing events normally.
ID: 45841 · Report as offensive     Reply Quote
tullio

Send message
Joined: 19 Feb 08
Posts: 704
Credit: 4,332,741
RAC: 21
Message 45842 - Posted: 9 Dec 2021, 14:01:23 UTC

I have 4 tasks running on 2 CPUs. They seem OK.
Tullio
ID: 45842 · Report as offensive     Reply Quote
Jonathan

Send message
Joined: 25 Sep 17
Posts: 76
Credit: 2,265,620
RAC: 4,850
Message 45843 - Posted: 9 Dec 2021, 14:36:47 UTC

I have 13 valid tasks as of this morning. Everything has been running smoothly besides the one task yesterday that just hung and got a computation error.

I think the vboxwrapper should be updated but developers need to figure out where the wrapper is getting its naming as the version of the file name doesn't exactly match the version reported in the task logs. i.e. vboxwrapper_26203_windows_x86_64.exe (file version 7.17.26202.0) reports as 26202 and Atlas wrapper vboxwrapper_26198ab7_windows_x86_64.exe (file version 7.7.26197.0) reports as 26197. Is this something in the source code and compilation? My finding are only on Windows as I don't have a Linux setup to check.

If anyone has tasks that just hang without processing, I would recomend posting the stderr.txt log and the task name details so those can be, possibly, followed up on to see if anything is common. I don't see any of this as a VirtualBox problem besides getting the best wrapper to minimize Postponed / lost communication issues.
ID: 45843 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1543
Credit: 52,460,004
RAC: 35,728
Message 45845 - Posted: 9 Dec 2021, 17:17:47 UTC - in response to Message 45843.  

.If anyone has tasks that just hang without processing, I would recomend posting the stderr.txt log and the task name details so those can be, possibly, followed up on to see if anything is common.
here is my latest one from this morning:

https://lhcathome.cern.ch/lhcathome/result.php?resultid=335568182

all others from the day(s) before failed with the same pattern.
ID: 45845 · Report as offensive     Reply Quote
Jonathan

Send message
Joined: 25 Sep 17
Posts: 76
Credit: 2,265,620
RAC: 4,850
Message 45846 - Posted: 9 Dec 2021, 17:41:37 UTC - in response to Message 45845.  

Yours all seemed to stop shortly after the 'Checking CVMFS' section.
I see 'Radical Guest time change' in successful tasks so I don't think that is the problem issue.
ID: 45846 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 1669
Credit: 98,243,549
RAC: 307,106
Message 45855 - Posted: 12 Dec 2021, 6:04:38 UTC

Crystal have tested the newest vboxwrapper:
2021-12-09 10:44:25 (3732): Detected: vboxwrapper 26202
2021-12-09 10:44:25 (3732): Detected: BOINC client v7.16.20
2021-12-09 10:44:26 (3732): Detected: VirtualBox VboxManage Interface (Version: 6.1.30)

We have no more problems with multicore Atlas-Tasks in Windows.

David wrote:
Could there be some Windows/Vbox update causing the problems? I can update the vboxwrapper version used by ATLAS if someone confirms that this fixes the problems.

It would be useful to see the same vboxwrapper for all vbox-projects (Theory, CMS and Atlas).
ID: 45855 · Report as offensive     Reply Quote
Jonathan

Send message
Joined: 25 Sep 17
Posts: 76
Credit: 2,265,620
RAC: 4,850
Message 45856 - Posted: 12 Dec 2021, 13:52:06 UTC - in response to Message 45855.  

26202 is a problem wrapper per the link https://boinc.berkeley.edu/trac/wiki/VboxApps#Premadevboxwrapperexecutables as it uses the COM interface.
26203 is reporting in the logs as '26202' and really should be reporting its own version number.

The bad WUs in this forum thread took care of themselves. It looked to be more an error of CVMFS within the VM hanging and the processing never starting.
ID: 45856 · Report as offensive     Reply Quote
CloverField

Send message
Joined: 17 Oct 06
Posts: 64
Credit: 37,165,584
RAC: 44,267
Message 45857 - Posted: 13 Dec 2021, 1:53:35 UTC - in response to Message 45856.  

26202 is a problem wrapper per the link https://boinc.berkeley.edu/trac/wiki/VboxApps#Premadevboxwrapperexecutables as it uses the COM interface.
26203 is reporting in the logs as '26202' and really should be reporting its own version number.

The bad WUs in this forum thread took care of themselves. It looked to be more an error of CVMFS within the VM hanging and the processing never starting.


So are we good to restart ATLAS multicore again? Or will we need to get this new vbox wrapper to get ATLAS working?
ID: 45857 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 1669
Credit: 98,243,549
RAC: 307,106
Message 45858 - Posted: 13 Dec 2021, 1:59:04 UTC - in response to Message 45857.  

The question was coming from David, so we have to wait for a answer from the Atlas-Team.
ID: 45858 · Report as offensive     Reply Quote
Jonathan

Send message
Joined: 25 Sep 17
Posts: 76
Credit: 2,265,620
RAC: 4,850
Message 45859 - Posted: 13 Dec 2021, 4:14:21 UTC - in response to Message 45857.  

I have been running Multi core Atlas tasks all weekend just fine.
ID: 45859 · Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project scientist

Send message
Joined: 13 May 14
Posts: 378
Credit: 14,413,006
RAC: 6,915
Message 45911 - Posted: 21 Dec 2021, 10:59:45 UTC - in response to Message 45858.  

The question was coming from David, so we have to wait for a answer from the Atlas-Team.


There were some issues with central databases at CERN around the time the problems were being reported, so that could have been the cause of the stuck or failing WU.

However I see that the vboxwrapper we use is from 2017(!) and should be updated. I have updated the ATLAS app on the LHC-dev project to use the latest version so please give it a try if you have an account there. If it looks good I will update it here, but probably not before the new year.
ID: 45911 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1543
Credit: 52,460,004
RAC: 35,728
Message 45929 - Posted: 22 Dec 2021, 11:51:06 UTC
Last modified: 22 Dec 2021, 12:16:27 UTC

A few minutes ago, I downloaded about 10 tasks, and one after the other failed within a few seconds with "computation error".
Also, another task which has run for 57 minutes failed with "computation error"

-185 (0xFFFFFF47) ERR_RESULT_START

stderr:

<core_client_version>7.16.11</core_client_version>
<![CDATA[
<message>
Couldn't copy file: fwrite() failed</message>
]]>

https://lhcathome.cern.ch/lhcathome/result.php?resultid=337659577
ID: 45929 · Report as offensive     Reply Quote
tullio

Send message
Joined: 19 Feb 08
Posts: 704
Credit: 4,332,741
RAC: 21
Message 45933 - Posted: 22 Dec 2021, 15:59:47 UTC

At QuChemPedIA@home upgrading BOINC to 7.16.20 solved a problem with security certificates.
Tullio
ID: 45933 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 · Next

Message boards : ATLAS application : Bad WUs?


©2022 CERN