21) Message boards : ATLAS application : Repeated computation errors - Missing Files (Message 45895)
Posted 18 Dec 2021 by Jonathan
Post:
Have you tried working through the Yeti checklist? You can also get a look inside the VM on different virtual terminals for some more information.

https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4302
22) Message boards : ATLAS application : Repeated computation errors - Missing Files (Message 45891)
Posted 18 Dec 2021 by Jonathan
Post:
Tullio, Please post at the Rosetta forum if you need help.
23) Message boards : ATLAS application : Repeated computation errors - Missing Files (Message 45888)
Posted 17 Dec 2021 by Jonathan
Post:
The current Rosetta VM is created using 6Gb of RAM and an 8 Gb HD.
24) Message boards : ATLAS application : Why don't you give my computer a task? (Message 45877)
Posted 16 Dec 2021 by Jonathan
Post:
How are you allocating cores to these work units? Are you using "app_config.xml" or are you using the website's project preferences like "Max # CPUs" and "Max # jobs"?
Do you have any other Boinc projects that use Virtual Box and are they working?
25) Questions and Answers : Windows : CMS Simulation work and all of them get stuck on random % and all of them got aborted. (Message 45867)
Posted 15 Dec 2021 by Jonathan
Post:
Are you still having trouble? We can only go off the information you provide.
26) Message boards : ATLAS application : Bad WUs? (Message 45859)
Posted 13 Dec 2021 by Jonathan
Post:
I have been running Multi core Atlas tasks all weekend just fine.
27) Message boards : ATLAS application : Bad WUs? (Message 45856)
Posted 12 Dec 2021 by Jonathan
Post:
26202 is a problem wrapper per the link https://boinc.berkeley.edu/trac/wiki/VboxApps#Premadevboxwrapperexecutables as it uses the COM interface.
26203 is reporting in the logs as '26202' and really should be reporting its own version number.

The bad WUs in this forum thread took care of themselves. It looked to be more an error of CVMFS within the VM hanging and the processing never starting.
28) Message boards : ATLAS application : Bad WUs? (Message 45846)
Posted 9 Dec 2021 by Jonathan
Post:
Yours all seemed to stop shortly after the 'Checking CVMFS' section.
I see 'Radical Guest time change' in successful tasks so I don't think that is the problem issue.
29) Message boards : ATLAS application : Bad WUs? (Message 45843)
Posted 9 Dec 2021 by Jonathan
Post:
I have 13 valid tasks as of this morning. Everything has been running smoothly besides the one task yesterday that just hung and got a computation error.

I think the vboxwrapper should be updated but developers need to figure out where the wrapper is getting its naming as the version of the file name doesn't exactly match the version reported in the task logs. i.e. vboxwrapper_26203_windows_x86_64.exe (file version 7.17.26202.0) reports as 26202 and Atlas wrapper vboxwrapper_26198ab7_windows_x86_64.exe (file version 7.7.26197.0) reports as 26197. Is this something in the source code and compilation? My finding are only on Windows as I don't have a Linux setup to check.

If anyone has tasks that just hang without processing, I would recomend posting the stderr.txt log and the task name details so those can be, possibly, followed up on to see if anything is common. I don't see any of this as a VirtualBox problem besides getting the best wrapper to minimize Postponed / lost communication issues.
30) Message boards : ATLAS application : Bad WUs? (Message 45837)
Posted 9 Dec 2021 by Jonathan
Post:
Tried exiting Boinc Manager and tasks and restarting Boinc. Work unit failed

Task https://lhcathome.cern.ch/lhcathome/result.php?resultid=335523334

Workunit https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=177601336
31) Message boards : ATLAS application : Bad WUs? (Message 45835)
Posted 9 Dec 2021 by Jonathan
Post:
stderr.txt log from stuck work unit

2021-12-08 18:59:59 (25680): Detected: vboxwrapper 26202
2021-12-08 18:59:59 (25680): Detected: BOINC client v7.16.20
2021-12-08 18:59:59 (25680): Detected: VirtualBox VboxManage Interface (Version: 6.1.30)
2021-12-08 18:59:59 (25680): Successfully copied 'init_data.xml' to the shared directory.
2021-12-08 19:00:00 (25680): Create VM. (boinc_493aaf73f80ced28, slot#1)
2021-12-08 19:00:01 (25680): Setting Memory Size for VM. (6600MB)
2021-12-08 19:00:01 (25680): Setting CPU Count for VM. (4)
2021-12-08 19:00:01 (25680): Setting Chipset Options for VM.
2021-12-08 19:00:02 (25680): Setting Boot Options for VM.
2021-12-08 19:00:02 (25680): Setting Network Configuration for NAT.
2021-12-08 19:00:02 (25680): Enabling VM Network Access.
2021-12-08 19:00:03 (25680): Disabling USB Support for VM.
2021-12-08 19:00:03 (25680): Disabling COM Port Support for VM.
2021-12-08 19:00:03 (25680): Disabling LPT Port Support for VM.
2021-12-08 19:00:03 (25680): Disabling Audio Support for VM.
2021-12-08 19:00:04 (25680): Disabling Clipboard Support for VM.
2021-12-08 19:00:04 (25680): Disabling Drag and Drop Support for VM.
2021-12-08 19:00:04 (25680): Adding storage controller(s) to VM.
2021-12-08 19:00:04 (25680): Adding virtual disk drive to VM. (vm_image.vdi)
2021-12-08 19:00:05 (25680): Adding VirtualBox Guest Additions to VM.
2021-12-08 19:00:05 (25680): Adding network bandwidth throttle group to VM. (Defaulting to 1024GB)
2021-12-08 19:00:05 (25680): forwarding host port 59563 to guest port 80
2021-12-08 19:00:05 (25680): Enabling remote desktop for VM.
2021-12-08 19:00:06 (25680): Enabling shared directory for VM.
2021-12-08 19:00:06 (25680): Starting VM using VBoxManage interface. (boinc_493aaf73f80ced28, slot#1)
2021-12-08 19:00:11 (25680): Successfully started VM. (PID = '1956')
2021-12-08 19:00:11 (25680): Reporting VM Process ID to BOINC.
2021-12-08 19:00:11 (25680): Guest Log: BIOS: VirtualBox 6.1.30
2021-12-08 19:00:11 (25680): Guest Log: CPUID EDX: 0x178bfbff
2021-12-08 19:00:11 (25680): Guest Log: BIOS: ata0-0: PCHS=16383/16/63 LCHS=1024/255/63
2021-12-08 19:00:11 (25680): VM state change detected. (old = 'poweredoff', new = 'running')
2021-12-08 19:00:11 (25680): Detected: Web Application Enabled (http://localhost:59563)
2021-12-08 19:00:11 (25680): Detected: Remote Desktop Enabled (localhost:59564)
2021-12-08 19:00:11 (25680): Preference change detected
2021-12-08 19:00:11 (25680): Setting CPU throttle for VM. (100%)
2021-12-08 19:00:11 (25680): Setting checkpoint interval to 900 seconds. (Higher value of (Preference: 60 seconds) or (Vbox_job.xml: 900 seconds))
2021-12-08 19:00:13 (25680): Guest Log: BIOS: Boot : bseqnr=1, bootseq=0032
2021-12-08 19:00:13 (25680): Guest Log: BIOS: Booting from Hard Disk...
2021-12-08 19:00:15 (25680): Guest Log: BIOS: KBD: unsupported int 16h function 03
2021-12-08 19:00:15 (25680): Guest Log: BIOS: AX=0305 BX=0000 CX=0000 DX=0000 
2021-12-08 19:00:15 (25680): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=81
2021-12-08 19:00:15 (25680): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=81
2021-12-08 19:00:15 (25680): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=82
2021-12-08 19:00:15 (25680): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=82
2021-12-08 19:00:15 (25680): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=83
2021-12-08 19:00:15 (25680): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=83
2021-12-08 19:00:15 (25680): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=84
2021-12-08 19:00:15 (25680): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=84
2021-12-08 19:00:15 (25680): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=85
2021-12-08 19:00:15 (25680): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=85
2021-12-08 19:00:15 (25680): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=86
2021-12-08 19:00:15 (25680): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=86
2021-12-08 19:00:15 (25680): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=87
2021-12-08 19:00:15 (25680): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=87
2021-12-08 19:00:15 (25680): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=88
2021-12-08 19:00:15 (25680): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=88
2021-12-08 19:00:15 (25680): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=89
2021-12-08 19:00:15 (25680): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=89
2021-12-08 19:00:15 (25680): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=8a
2021-12-08 19:00:15 (25680): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=8a
2021-12-08 19:00:15 (25680): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=8b
2021-12-08 19:00:15 (25680): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=8b
2021-12-08 19:00:15 (25680): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=8c
2021-12-08 19:00:15 (25680): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=8c
2021-12-08 19:00:15 (25680): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=8d
2021-12-08 19:00:15 (25680): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=8d
2021-12-08 19:00:15 (25680): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=8e
2021-12-08 19:00:15 (25680): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=8e
2021-12-08 19:00:15 (25680): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=8f
2021-12-08 19:00:15 (25680): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=8f
2021-12-08 19:00:19 (25680): Guest Log: vgdrvHeartbeatInit: Setting up heartbeat to trigger every 2000 milliseconds
2021-12-08 19:00:19 (25680): Guest Log: vboxguest: misc device minor 58, IRQ 20, I/O port d020, MMIO at 00000000f0400000 (size 0x400000)
2021-12-08 19:00:22 (25680): Guest Log: Checking CVMFS...
2021-12-08 19:00:24 (25680): Guest Log: VBoxService 5.2.32 r132073 (verbosity: 0) linux.amd64 (Jul 12 2019 10:32:28) release log
2021-12-08 19:00:24 (25680): Guest Log: 00:00:00.000155 main     Log opened 2021-12-08T19:00:22.038429000Z
2021-12-08 19:00:24 (25680): Guest Log: 00:00:00.000262 main     OS Product: Linux
2021-12-08 19:00:24 (25680): Guest Log: 00:00:00.000297 main     OS Release: 3.10.0-957.27.2.el7.x86_64
2021-12-08 19:00:24 (25680): Guest Log: 00:00:00.000325 main     OS Version: #1 SMP Mon Jul 29 17:46:05 UTC 2019
2021-12-08 19:00:24 (25680): Guest Log: 00:00:00.000353 main     Executable: /opt/VBoxGuestAdditions-5.2.32/sbin/VBoxService
2021-12-08 19:00:24 (25680): Guest Log: 00:00:00.000354 main     Process ID: 1492
2021-12-08 19:00:24 (25680): Guest Log: 00:00:00.000354 main     Package type: LINUX_64BITS_GENERIC
2021-12-08 19:00:24 (25680): Guest Log: 00:00:00.001280 main     5.2.32 r132073 started. Verbose level = 0
2021-12-08 19:00:34 (25680): Guest Log: 00:00:10.010700 timesync vgsvcTimeSyncWorker: Radical guest time change: 21 611 928 059 000ns (GuestNow=1 639 011 633 968 992 000 ns GuestLast=1 638 990 022 040 933 000 ns fSetTimeLastLoop=true )
2021-12-08 20:40:19 (25680): Status Report: Elapsed Time: '6000.000000'
2021-12-08 20:40:19 (25680): Status Report: CPU Time: '45.843750'


I looked at my successful tasks and it looks like it is hanging at the CVMFS check or something, mounting shared files and then copying files into RunAtlas.
Below is from a successful task where I think it may be failing or getting hung. Just a theory.


2021-12-08 16:23:47 (11704): Guest Log: CVMFS is ok
2021-12-08 16:23:47 (11704): Guest Log: Mounting shared directory
2021-12-08 16:23:47 (11704): Guest Log: Copying input files
2021-12-08 16:23:49 (11704): Guest Log: Copied input files into RunAtlas.
2021-12-08 16:23:50 (11704): Guest Log: copied the webapp to /var/www
2021-12-08 16:23:50 (11704): Guest Log: This VM did not configure a local http proxy via BOINC.
2021-12-08 16:23:50 (11704): Guest Log: Small home clusters do not require a local http proxy but it is suggested if
2021-12-08 16:23:50 (11704): Guest Log: more than 10 cores throughout the same LAN segment are regularly running ATLAS like tasks.
2021-12-08 16:23:50 (11704): Guest Log: Further information can be found at the LHC@home message board.
2021-12-08 16:23:50 (11704): Guest Log: Running cvmfs_config stat atlas.cern.ch
2021-12-08 16:23:50 (11704): Guest Log: VERSION PID UPTIME(M) MEM(K) REVISION EXPIRES(M) NOCATALOGS CACHEUSE(K) CACHEMAX(K) NOFDUSE NOFDMAX NOIOERR NOOPEN HITRATE(%) RX(K) SPEED(K/S) HOST PROXY ONLINE
2021-12-08 16:23:50 (11704): Guest Log: 2.6.3.0 1617 0 29828 97456 3 1 1491998 4096001 0 65024 0 102 98.0392 533 455 http://s1cern-cvmfs.openhtc.io/cvmfs/atlas.cern.ch DIRECT 1
2021-12-08 16:23:50 (11704): Guest Log: ATHENA_PROC_NUMBER=4
2021-12-08 16:23:51 (11704): Guest Log:  *** Starting ATLAS job. (PandaID=5285080426 taskID=27554476) ***
32) Message boards : ATLAS application : Bad WUs? (Message 45834)
Posted 9 Dec 2021 by Jonathan
Post:
Next VM started up but it isn't running anything inside. All three virtual terminals look identical. Not getting Atlas Event Process Monitoring start up on virt term 2 nor TOP on virt term 3.
Guessing something broken inside VM.

Hung work unit below. I am just going to leave it stuck for now.

boinc_493aaf73f80ced28
Application
ATLAS Simulation 2.00 (vbox64_mt_mcore_atlas)
Name
k5pLDmaoCF0nfZGDcpSWOuwoABFKDmABFKDmQ2DbDmABFKDmy3TEWm
State
Running
Received
12/8/2021 4:18:10 PM
Report deadline
12/15/2021 4:18:08 PM
Resources
4 CPUs
Estimated computation size
43,200 GFLOPs
CPU time
00:00:32
CPU time since checkpoint
00:00:00
Elapsed time
00:21:07
Estimated time remaining
01:57:53
Fraction done
15.195%
Virtual memory size
80.82 MB
Working set size
6.45 GB
Directory
slots/1
Process ID
25680
Progress rate
43.200% per hour
Executable
vboxwrapper_26198ab7_windows_x86_64.exe
33) Message boards : ATLAS application : Bad WUs? (Message 45833)
Posted 9 Dec 2021 by Jonathan
Post:
Task just completed.
I was able to run two, 4 cores tasks at once. 8 processor cores in use. I left SMT on so 8/16 in use. I only have 16Gb and it was almost all in use due to each task taking 6600Kb memory.
Second task should finish up in about 45 more minutes but I don't see any problems. I only had LHC / Atlas running. No other projects or work.

I will just let the machine continue and see if it gets any trouble work units.
34) Message boards : ATLAS application : Bad WUs? (Message 45832)
Posted 8 Dec 2021 by Jonathan
Post:
My 4 core task is behaving normally.
I think I got the wrapper changed as log shows "2021-12-08 16:22:04 (11704): Detected: vboxwrapper 26202" It's the new 26203 misreporting then number, as usual.

About 25 min in on work unit and I have all 4 athena.py running. Virtual consoles 2 and 3 look normal.
35) Questions and Answers : Windows : CMS Simulation work and all of them get stuck on random % and all of them got aborted. (Message 45786)
Posted 5 Dec 2021 by Jonathan
Post:
Use your LCH@home preferences to limit the number of work units.
Set Max # jobs to the number of true cores or less that your processor has.
Virtual Box jobs run at normal priority and not a lower priority like legacy Boinc jobs.
36) Questions and Answers : Preferences : errors (Message 45782)
Posted 4 Dec 2021 by Jonathan
Post:
I would guess a permissions or other error related to Virtual Box.
Try the latest version
https://www.virtualbox.org/wiki/Downloads
37) Questions and Answers : Unix/Linux : ATLAS seems to block my BOINC (Message 45777)
Posted 2 Dec 2021 by Jonathan
Post:
That check list is a very good idea.
I am going to take a guess that the VM created for Atlas is taking about 10Gb of RAM and the number of processors is set to 8 within the VM.

Go to your LHC preferences and set the following below. This will run one Atlas task with the minimum amount of RAM required and create a VM with a single processor. You probably have to set no new tasks for LHC, abort any current Atlas tasks, exit and restart boinc after changing your preferences.

I looked up your processor and you have 4 true cores. That is the highest you should set your Max # CPUs to.

You might need to look at a different sub project, rather than Atlas

Max # jobs 1
Max # CPUs 1
38) Message boards : ATLAS application : 100% CPU Use (Message 45765)
Posted 30 Nov 2021 by Jonathan
Post:
That sounds like the normal behavior for Virtual Box projects. You can set Max # cpus to get a lower cpu count per task and try Max # Jobs to limit running tasks. You can also control the behavior using app_config.xml method.

Virtual Box tasks run at normal priority and require a bit of time to shut down / save the running VM state. I just exit Boinc Manager and have it kill the processes. Give it a minute or two and then reboot or shut down.
39) Message boards : Theory Application : All errors on my new Ryzen (Message 44352)
Posted 21 Feb 2021 by Jonathan
Post:
Do you have any other projects that use Virtual Box and are working on that machine?
6.1 series of VirtualBox is the only one supported per the VirtualBox website.
You can try uninstalling and reinstalling or move to the 6.1 version. Try opening VirtualBox and watching the start up of the Theory task. It may give some clues also.
40) Message boards : Theory Application : Running.log only available to download (Message 44044)
Posted 1 Jan 2021 by Jonathan
Post:
Try going into Firefox settings. General Tab. Section Applications. Check to see if log or .log is listed in that area?


Previous 20 · Next 20


©2024 CERN