Message boards :
ATLAS application :
Bad WUs?
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 · Next
Author | Message |
---|---|
Send message Joined: 25 Sep 17 Posts: 99 Credit: 3,425,566 RAC: 0 |
Next VM started up but it isn't running anything inside. All three virtual terminals look identical. Not getting Atlas Event Process Monitoring start up on virt term 2 nor TOP on virt term 3. Guessing something broken inside VM. Hung work unit below. I am just going to leave it stuck for now. boinc_493aaf73f80ced28 Application ATLAS Simulation 2.00 (vbox64_mt_mcore_atlas) Name k5pLDmaoCF0nfZGDcpSWOuwoABFKDmABFKDmQ2DbDmABFKDmy3TEWm State Running Received 12/8/2021 4:18:10 PM Report deadline 12/15/2021 4:18:08 PM Resources 4 CPUs Estimated computation size 43,200 GFLOPs CPU time 00:00:32 CPU time since checkpoint 00:00:00 Elapsed time 00:21:07 Estimated time remaining 01:57:53 Fraction done 15.195% Virtual memory size 80.82 MB Working set size 6.45 GB Directory slots/1 Process ID 25680 Progress rate 43.200% per hour Executable vboxwrapper_26198ab7_windows_x86_64.exe |
Send message Joined: 25 Sep 17 Posts: 99 Credit: 3,425,566 RAC: 0 |
stderr.txt log from stuck work unit 2021-12-08 18:59:59 (25680): Detected: vboxwrapper 26202 2021-12-08 18:59:59 (25680): Detected: BOINC client v7.16.20 2021-12-08 18:59:59 (25680): Detected: VirtualBox VboxManage Interface (Version: 6.1.30) 2021-12-08 18:59:59 (25680): Successfully copied 'init_data.xml' to the shared directory. 2021-12-08 19:00:00 (25680): Create VM. (boinc_493aaf73f80ced28, slot#1) 2021-12-08 19:00:01 (25680): Setting Memory Size for VM. (6600MB) 2021-12-08 19:00:01 (25680): Setting CPU Count for VM. (4) 2021-12-08 19:00:01 (25680): Setting Chipset Options for VM. 2021-12-08 19:00:02 (25680): Setting Boot Options for VM. 2021-12-08 19:00:02 (25680): Setting Network Configuration for NAT. 2021-12-08 19:00:02 (25680): Enabling VM Network Access. 2021-12-08 19:00:03 (25680): Disabling USB Support for VM. 2021-12-08 19:00:03 (25680): Disabling COM Port Support for VM. 2021-12-08 19:00:03 (25680): Disabling LPT Port Support for VM. 2021-12-08 19:00:03 (25680): Disabling Audio Support for VM. 2021-12-08 19:00:04 (25680): Disabling Clipboard Support for VM. 2021-12-08 19:00:04 (25680): Disabling Drag and Drop Support for VM. 2021-12-08 19:00:04 (25680): Adding storage controller(s) to VM. 2021-12-08 19:00:04 (25680): Adding virtual disk drive to VM. (vm_image.vdi) 2021-12-08 19:00:05 (25680): Adding VirtualBox Guest Additions to VM. 2021-12-08 19:00:05 (25680): Adding network bandwidth throttle group to VM. (Defaulting to 1024GB) 2021-12-08 19:00:05 (25680): forwarding host port 59563 to guest port 80 2021-12-08 19:00:05 (25680): Enabling remote desktop for VM. 2021-12-08 19:00:06 (25680): Enabling shared directory for VM. 2021-12-08 19:00:06 (25680): Starting VM using VBoxManage interface. (boinc_493aaf73f80ced28, slot#1) 2021-12-08 19:00:11 (25680): Successfully started VM. (PID = '1956') 2021-12-08 19:00:11 (25680): Reporting VM Process ID to BOINC. 2021-12-08 19:00:11 (25680): Guest Log: BIOS: VirtualBox 6.1.30 2021-12-08 19:00:11 (25680): Guest Log: CPUID EDX: 0x178bfbff 2021-12-08 19:00:11 (25680): Guest Log: BIOS: ata0-0: PCHS=16383/16/63 LCHS=1024/255/63 2021-12-08 19:00:11 (25680): VM state change detected. (old = 'poweredoff', new = 'running') 2021-12-08 19:00:11 (25680): Detected: Web Application Enabled (http://localhost:59563) 2021-12-08 19:00:11 (25680): Detected: Remote Desktop Enabled (localhost:59564) 2021-12-08 19:00:11 (25680): Preference change detected 2021-12-08 19:00:11 (25680): Setting CPU throttle for VM. (100%) 2021-12-08 19:00:11 (25680): Setting checkpoint interval to 900 seconds. (Higher value of (Preference: 60 seconds) or (Vbox_job.xml: 900 seconds)) 2021-12-08 19:00:13 (25680): Guest Log: BIOS: Boot : bseqnr=1, bootseq=0032 2021-12-08 19:00:13 (25680): Guest Log: BIOS: Booting from Hard Disk... 2021-12-08 19:00:15 (25680): Guest Log: BIOS: KBD: unsupported int 16h function 03 2021-12-08 19:00:15 (25680): Guest Log: BIOS: AX=0305 BX=0000 CX=0000 DX=0000 2021-12-08 19:00:15 (25680): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=81 2021-12-08 19:00:15 (25680): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=81 2021-12-08 19:00:15 (25680): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=82 2021-12-08 19:00:15 (25680): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=82 2021-12-08 19:00:15 (25680): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=83 2021-12-08 19:00:15 (25680): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=83 2021-12-08 19:00:15 (25680): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=84 2021-12-08 19:00:15 (25680): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=84 2021-12-08 19:00:15 (25680): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=85 2021-12-08 19:00:15 (25680): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=85 2021-12-08 19:00:15 (25680): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=86 2021-12-08 19:00:15 (25680): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=86 2021-12-08 19:00:15 (25680): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=87 2021-12-08 19:00:15 (25680): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=87 2021-12-08 19:00:15 (25680): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=88 2021-12-08 19:00:15 (25680): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=88 2021-12-08 19:00:15 (25680): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=89 2021-12-08 19:00:15 (25680): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=89 2021-12-08 19:00:15 (25680): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=8a 2021-12-08 19:00:15 (25680): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=8a 2021-12-08 19:00:15 (25680): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=8b 2021-12-08 19:00:15 (25680): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=8b 2021-12-08 19:00:15 (25680): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=8c 2021-12-08 19:00:15 (25680): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=8c 2021-12-08 19:00:15 (25680): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=8d 2021-12-08 19:00:15 (25680): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=8d 2021-12-08 19:00:15 (25680): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=8e 2021-12-08 19:00:15 (25680): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=8e 2021-12-08 19:00:15 (25680): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=8f 2021-12-08 19:00:15 (25680): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=8f 2021-12-08 19:00:19 (25680): Guest Log: vgdrvHeartbeatInit: Setting up heartbeat to trigger every 2000 milliseconds 2021-12-08 19:00:19 (25680): Guest Log: vboxguest: misc device minor 58, IRQ 20, I/O port d020, MMIO at 00000000f0400000 (size 0x400000) 2021-12-08 19:00:22 (25680): Guest Log: Checking CVMFS... 2021-12-08 19:00:24 (25680): Guest Log: VBoxService 5.2.32 r132073 (verbosity: 0) linux.amd64 (Jul 12 2019 10:32:28) release log 2021-12-08 19:00:24 (25680): Guest Log: 00:00:00.000155 main Log opened 2021-12-08T19:00:22.038429000Z 2021-12-08 19:00:24 (25680): Guest Log: 00:00:00.000262 main OS Product: Linux 2021-12-08 19:00:24 (25680): Guest Log: 00:00:00.000297 main OS Release: 3.10.0-957.27.2.el7.x86_64 2021-12-08 19:00:24 (25680): Guest Log: 00:00:00.000325 main OS Version: #1 SMP Mon Jul 29 17:46:05 UTC 2019 2021-12-08 19:00:24 (25680): Guest Log: 00:00:00.000353 main Executable: /opt/VBoxGuestAdditions-5.2.32/sbin/VBoxService 2021-12-08 19:00:24 (25680): Guest Log: 00:00:00.000354 main Process ID: 1492 2021-12-08 19:00:24 (25680): Guest Log: 00:00:00.000354 main Package type: LINUX_64BITS_GENERIC 2021-12-08 19:00:24 (25680): Guest Log: 00:00:00.001280 main 5.2.32 r132073 started. Verbose level = 0 2021-12-08 19:00:34 (25680): Guest Log: 00:00:10.010700 timesync vgsvcTimeSyncWorker: Radical guest time change: 21 611 928 059 000ns (GuestNow=1 639 011 633 968 992 000 ns GuestLast=1 638 990 022 040 933 000 ns fSetTimeLastLoop=true ) 2021-12-08 20:40:19 (25680): Status Report: Elapsed Time: '6000.000000' 2021-12-08 20:40:19 (25680): Status Report: CPU Time: '45.843750' I looked at my successful tasks and it looks like it is hanging at the CVMFS check or something, mounting shared files and then copying files into RunAtlas. Below is from a successful task where I think it may be failing or getting hung. Just a theory. 2021-12-08 16:23:47 (11704): Guest Log: CVMFS is ok 2021-12-08 16:23:47 (11704): Guest Log: Mounting shared directory 2021-12-08 16:23:47 (11704): Guest Log: Copying input files 2021-12-08 16:23:49 (11704): Guest Log: Copied input files into RunAtlas. 2021-12-08 16:23:50 (11704): Guest Log: copied the webapp to /var/www 2021-12-08 16:23:50 (11704): Guest Log: This VM did not configure a local http proxy via BOINC. 2021-12-08 16:23:50 (11704): Guest Log: Small home clusters do not require a local http proxy but it is suggested if 2021-12-08 16:23:50 (11704): Guest Log: more than 10 cores throughout the same LAN segment are regularly running ATLAS like tasks. 2021-12-08 16:23:50 (11704): Guest Log: Further information can be found at the LHC@home message board. 2021-12-08 16:23:50 (11704): Guest Log: Running cvmfs_config stat atlas.cern.ch 2021-12-08 16:23:50 (11704): Guest Log: VERSION PID UPTIME(M) MEM(K) REVISION EXPIRES(M) NOCATALOGS CACHEUSE(K) CACHEMAX(K) NOFDUSE NOFDMAX NOIOERR NOOPEN HITRATE(%) RX(K) SPEED(K/S) HOST PROXY ONLINE 2021-12-08 16:23:50 (11704): Guest Log: 2.6.3.0 1617 0 29828 97456 3 1 1491998 4096001 0 65024 0 102 98.0392 533 455 http://s1cern-cvmfs.openhtc.io/cvmfs/atlas.cern.ch DIRECT 1 2021-12-08 16:23:50 (11704): Guest Log: ATHENA_PROC_NUMBER=4 2021-12-08 16:23:51 (11704): Guest Log: *** Starting ATLAS job. (PandaID=5285080426 taskID=27554476) *** |
Send message Joined: 2 May 07 Posts: 2244 Credit: 173,902,375 RAC: 307 |
Single Core running is ok with using vboxwrapper from Atlas-Task.https://lhcathome.cern.ch/lhcathome/results.php?hostid=10664116 This PC using Win10-Workstation and not Win10pro. Have to upgrade in the next few hours my last RYZEN from Win10pro to Win11pro. After this, will testing Atlas Multicore with default vboxwrapper from Atlas in Win11pro. |
Send message Joined: 25 Sep 17 Posts: 99 Credit: 3,425,566 RAC: 0 |
Tried exiting Boinc Manager and tasks and restarting Boinc. Work unit failed Task https://lhcathome.cern.ch/lhcathome/result.php?resultid=335523334 Workunit https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=177601336 |
Send message Joined: 15 Jun 08 Posts: 2541 Credit: 254,608,838 RAC: 23,290 |
I usually run ATLAS native singlecore but to test what happens I started an ATLAS native 4-core. https://lhcathome.cern.ch/lhcathome/result.php?resultid=335511517 The 4-core task finished last night and got credits but since the HITS file was among the files that were written to the wrong directory level stderr.txt reported this: [2021-12-09 03:45:01] No HITS result produced Early this morning the same computer/BOINC client got another 2 4-core tasks (native). So far both are running fine and keep all files within their own \slots\ directory. |
Send message Joined: 19 Feb 08 Posts: 708 Credit: 4,336,250 RAC: 0 |
I'vw killed three tasks on my 2 cores but one of them was doing good. The other two used very little CPU. The one good had a CPU time twice the running time. Tullio |
Send message Joined: 2 May 07 Posts: 2244 Credit: 173,902,375 RAC: 307 |
Win11pro successful installed.https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10631979 Two 2-core Atlas-Tasks are running. vboxwrapper 26198ab7 (default from Atlas-task). |
Send message Joined: 14 Jan 10 Posts: 1422 Credit: 9,484,585 RAC: 852 |
Could anybody running one of the affected Windows computers try out the vboxwrapper that comes with CMS?After reading the problem here, I just tried one task yesterday. I not used the actual vboxwrapper from the project, but an older version of 8-8-2016 https://lhcathome.cern.ch/lhcathome/result.php?resultid=335442116 The only thing I noticed was, it seems that it lasted longer before the 4 athena processes started (about 5 minutes) Edit:As requested by computezrmle: I started an ATLAS task with the newest wrapper v26203 (used by CMS). Starting fine and processing events normally. |
Send message Joined: 19 Feb 08 Posts: 708 Credit: 4,336,250 RAC: 0 |
I have 4 tasks running on 2 CPUs. They seem OK. Tullio |
Send message Joined: 25 Sep 17 Posts: 99 Credit: 3,425,566 RAC: 0 |
I have 13 valid tasks as of this morning. Everything has been running smoothly besides the one task yesterday that just hung and got a computation error. I think the vboxwrapper should be updated but developers need to figure out where the wrapper is getting its naming as the version of the file name doesn't exactly match the version reported in the task logs. i.e. vboxwrapper_26203_windows_x86_64.exe (file version 7.17.26202.0) reports as 26202 and Atlas wrapper vboxwrapper_26198ab7_windows_x86_64.exe (file version 7.7.26197.0) reports as 26197. Is this something in the source code and compilation? My finding are only on Windows as I don't have a Linux setup to check. If anyone has tasks that just hang without processing, I would recomend posting the stderr.txt log and the task name details so those can be, possibly, followed up on to see if anything is common. I don't see any of this as a VirtualBox problem besides getting the best wrapper to minimize Postponed / lost communication issues. |
Send message Joined: 18 Dec 15 Posts: 1821 Credit: 118,983,656 RAC: 18,514 |
.If anyone has tasks that just hang without processing, I would recomend posting the stderr.txt log and the task name details so those can be, possibly, followed up on to see if anything is common.here is my latest one from this morning: https://lhcathome.cern.ch/lhcathome/result.php?resultid=335568182 all others from the day(s) before failed with the same pattern. |
Send message Joined: 25 Sep 17 Posts: 99 Credit: 3,425,566 RAC: 0 |
Yours all seemed to stop shortly after the 'Checking CVMFS' section. I see 'Radical Guest time change' in successful tasks so I don't think that is the problem issue. |
Send message Joined: 2 May 07 Posts: 2244 Credit: 173,902,375 RAC: 307 |
Crystal have tested the newest vboxwrapper: 2021-12-09 10:44:25 (3732): Detected: vboxwrapper 26202 2021-12-09 10:44:25 (3732): Detected: BOINC client v7.16.20 2021-12-09 10:44:26 (3732): Detected: VirtualBox VboxManage Interface (Version: 6.1.30) We have no more problems with multicore Atlas-Tasks in Windows. David wrote: Could there be some Windows/Vbox update causing the problems? I can update the vboxwrapper version used by ATLAS if someone confirms that this fixes the problems. It would be useful to see the same vboxwrapper for all vbox-projects (Theory, CMS and Atlas). |
Send message Joined: 25 Sep 17 Posts: 99 Credit: 3,425,566 RAC: 0 |
26202 is a problem wrapper per the link https://boinc.berkeley.edu/trac/wiki/VboxApps#Premadevboxwrapperexecutables as it uses the COM interface. 26203 is reporting in the logs as '26202' and really should be reporting its own version number. The bad WUs in this forum thread took care of themselves. It looked to be more an error of CVMFS within the VM hanging and the processing never starting. |
Send message Joined: 17 Oct 06 Posts: 89 Credit: 57,163,734 RAC: 3,524 |
26202 is a problem wrapper per the link https://boinc.berkeley.edu/trac/wiki/VboxApps#Premadevboxwrapperexecutables as it uses the COM interface. So are we good to restart ATLAS multicore again? Or will we need to get this new vbox wrapper to get ATLAS working? |
Send message Joined: 2 May 07 Posts: 2244 Credit: 173,902,375 RAC: 307 |
The question was coming from David, so we have to wait for a answer from the Atlas-Team. |
Send message Joined: 25 Sep 17 Posts: 99 Credit: 3,425,566 RAC: 0 |
I have been running Multi core Atlas tasks all weekend just fine. |
Send message Joined: 13 May 14 Posts: 387 Credit: 15,314,184 RAC: 0 |
The question was coming from David, so we have to wait for a answer from the Atlas-Team. There were some issues with central databases at CERN around the time the problems were being reported, so that could have been the cause of the stuck or failing WU. However I see that the vboxwrapper we use is from 2017(!) and should be updated. I have updated the ATLAS app on the LHC-dev project to use the latest version so please give it a try if you have an account there. If it looks good I will update it here, but probably not before the new year. |
Send message Joined: 18 Dec 15 Posts: 1821 Credit: 118,983,656 RAC: 18,514 |
A few minutes ago, I downloaded about 10 tasks, and one after the other failed within a few seconds with "computation error". Also, another task which has run for 57 minutes failed with "computation error" -185 (0xFFFFFF47) ERR_RESULT_START stderr: <core_client_version>7.16.11</core_client_version> <![CDATA[ <message> Couldn't copy file: fwrite() failed</message> ]]> https://lhcathome.cern.ch/lhcathome/result.php?resultid=337659577 |
Send message Joined: 19 Feb 08 Posts: 708 Credit: 4,336,250 RAC: 0 |
At QuChemPedIA@home upgrading BOINC to 7.16.20 solved a problem with security certificates. Tullio |
©2024 CERN