1) Message boards : Theory Application : Theory Error fail to compile yoda2flat-split (Message 49040)
Posted 15 Dec 2023 by broz69
Post:
Hi,

I see some errors in Theory jobs (Win, VBox) on my computer (hostid= 10834815).

g++: /lib64/libc.so.6: version `GLIBC_2.14' not found (required by g++)
make: *** [yoda2flat-split.exe] Error 1
make: Leaving directory `/shared/rivetvm'
ERROR: fail to compile yoda2flat-split


Anything happening with VM?

Best regards.
2) Questions and Answers : Windows : Theory and CMS jobs fail after resume (Win + VBox) (Message 48971)
Posted 30 Nov 2023 by broz69
Post:
Hi,

Thank you for this answer. I can't say that I like it. I think that 8 is not a lot of tasks, compared to others that have even more cores... And I was also checking other users with similar configurations and all of them had same issues at some time in the past. So I can't say that my problem is unique.
In the mean time I did the following:
- I created a RAM disk with ImDisk (RAM speed is 3200MHz)
- in Oracle VM VirtualBox Manager I created and installed 8 Linux machines (2GB RAM, 1 core) and saved these 8 virtual disks (dynamic size) to RAM disk
- I created a group with all of them
- then I run all 8 of them at the same time (as a group) - no problem
- I paused all 8 of them - no problem
- I resumed all 8 of them - no problem
- I saved all 8 of them - no problem
- I resumed all 8 of them - one machine was corrupted and did not start correctly
The problem after all these steps above was, that Oracle VM VirtualBox Manager was consuming 100% of processor (all 8C/16T at 100%). I had to suspend all VMs and restart the manager.

Now back to my problem. If I suspend and then resume CMS job it seems that it continues running OK. I tried it and it's OK.
As I see the problem lies in VBox that cannot handle suspend/resume of "large" numbers of VMs or in Boinc that does something that VBox can't handle.
Can't we ask Boinc not to suspend/resume VM jobs all at the same time but like in steps of 1 with some delay between them? Or why does Boinc even do this when we know that VBox can't handle suspend/resume of "large" numbers of VMs at the same time? Or why does Boinc even start Atlas jobs when we know that there'll be problems with CMS and Theory jobs running?
Or can we ask Oracle to fix VBox Manager?

What's your view on this?

The problem that I'd like to solve is that when Theory and CMS jobs resume they do that without error. In case of error the time and energy spent on failed jobs is useless and on a top we don't get any credits ;)

Best regards.
3) Questions and Answers : Windows : Theory and CMS jobs fail after resume (Win + VBox) (Message 48969)
Posted 29 Nov 2023 by broz69
Post:
Hi,

I thought I could share this with you. I am seeing undesired behaviour on two my LHC crunching machines. Both of them Windows 11 with VBox (hostid= 10834815 and hostid=10616627).

The situation is as follows:
- some Theory and CMS jobs run
- BOINC fetches some new work and these tasks happen to be Atlas
- when tasks download BOINC decides (probably based on some algorithm connected with deadline) that Atlas jobs have priority
- Theory and CMS jobs get either paused either saved in VBox (which is already strange - why some tasks save and others pause?)
- Atlas jobs run and finish OK
- after Atlas jobs finish, some old jobs continue OK, some not and throw this error (on hostid=10834815 only two tasks finished OK, other 6 threw error):

CMS_3858007_1701010523.959683 217385371
---------------------------
VBoxHeadless.exe - Application Error
---------------------------
The exception Breakpoint
A breakpoint has been reached.
(0x80000003) occurred in the application at location 0x00007FFAFFEA4CFF.

Click on OK to terminate the program


Theory_2390-1127772-964 217376900
---------------------------
VBoxHeadless.exe - Application Error
---------------------------
The exception Breakpoint
A breakpoint has been reached.
(0x80000003) occurred in the application at location 0x00007FFAFFEA4CFF.

Click on OK to terminate the program


Theory_2390-1104457-964 217377861
---------------------------
VBoxHeadless.exe - Application Error
---------------------------
The exception Breakpoint
A breakpoint has been reached.
(0x80000003) occurred in the application at location 0x00007FFAFFEA4CFF.

Click on OK to terminate the program


Theory_2390-1115150-964 217378166
---------------------------
VBoxHeadless.exe - Application Error
---------------------------
The exception Breakpoint
A breakpoint has been reached.
(0x80000003) occurred in the application at location 0x00007FFAFFEA4CFF.

Click on OK to terminate the program


Theory_2390-1122909-968 217391846
---------------------------
VBoxHeadless.exe - Application Error
---------------------------
The exception Breakpoint
A breakpoint has been reached.
(0x80000003) occurred in the application at location 0x00007FFAFFEA4CFF.

Click on OK to terminate the program


CMS_3856916_1701010223.879077 217385367
---------------------------
VBoxHeadless.exe - Application Error
---------------------------
The exception Breakpoint
A breakpoint has been reached.
(0x80000003) occurred in the application at location 0x00007FFAFFEA4CFF.

Click on OK to terminate the program


BOINC setting "Leave non-GPU tasks in memory while suspended" = false.
I have seen this behaviour for some time now and I'm following it more closely in the last few days and it is always the same (if you check failed jobs for these two hostids they are always because of this switching between different LHC jobs). I'm not using BOINC for any other computation. Linux native seems OK.

Best regards.
4) Message boards : Theory Application : Theory Failure Ratio Explodes (Message 48596)
Posted 20 Sep 2023 by broz69
Post:
Hi,

It's what computezrmle wrote 2638 jobs fail and 2390 are OK,
In the morning the jobs were not all 2390:
Theory_2638 - 35 failed
Theory_2637 - 26 failed
Theory_2636 - 24 failed
Theory_2390 - 1 OK

In the evening I got some Theory jobs, all of them 2390.
Theory_2390-1109174-576, Theory_2390-1100306-576, Theory_2390-1140982-576, Theory_2390-1099685-576 finished OK without proxy. But so did others with proxy. So it doesn't seem a problem with proxy, vbox or Boinc.

So I'll just wait that people at LHC find the solution.

Thanks.
5) Message boards : Theory Application : Theory Failure Ratio Explodes (Message 48592)
Posted 20 Sep 2023 by broz69
Post:
At the moment both my Windows computers are set to "No new tasks".

Only Linux is running (native). Linux only has Theory_2390 jobs.
6) Message boards : Theory Application : Theory Failure Ratio Explodes (Message 48591)
Posted 20 Sep 2023 by broz69
Post:
They're visible now.
7) Message boards : Theory Application : Theory Failure Ratio Explodes (Message 48588)
Posted 20 Sep 2023 by broz69
Post:
Hi,

My Theory jobs are still failing (hostID=10834815).

Different errors
runRivet
Setting environment...
grep: /etc/redhat-release: No such file or directory
./runRivet.sh: line 33: /cvmfs/sft.cern.ch/lcg/releases/LCG_102b_ATLAS_28/../gcc/11.3.0/x86_64-slc6/setup.sh: No such file or directory
ERROR: fail to set environment (gcc)

or

make: *** [yoda2flat-split.exe] Error 1
make: Leaving directory `/shared/rivetvm'
ERROR: fail to compile yoda2flat-split

or

just hangs at
Running job shoud appear here.
[INFO] Container 'runc' finished with status code 1.

When I shutdown the VM it reports job as finished (which is strange)...

Is there any special procedure in place to recover from this?

Best regards.
8) Questions and Answers : Getting started : CPU does not have hardware virtualization support? (Message 48422)
Posted 11 Aug 2023 by broz69
Post:
YES! I somehow missed this. Apparently the Update chaneged that. But I somehow missed it from the list...

Thank you for your help!

And have a nice weekend!
9) Questions and Answers : Getting started : CPU does not have hardware virtualization support? (Message 48418)
Posted 10 Aug 2023 by broz69
Post:
Hyper-V is off.
AMD-V is enabled in BIOS.
If I create a new virtual machine in Virtualbox and I start it - it starts without problems, no error.
As I said - I was able to run jobs until 15:40 CEST when computer rebooted after Windows Update. After that all my Theory jobs failed and I didn't get any new jobs.

I installed new windows 11 on another machine and I have the same issue. Unfortunately I wasn't fast enough and win 11 simply installed all the latest updates...
https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10834815
This is from new machine's BOINC event log:
10/08/2023 22:38:26 | | Processor: 16 AuthenticAMD AMD Ryzen 7 5700G with Radeon Graphics [Family 25 Model 80 Stepping 0]
10/08/2023 22:38:26 | | Processor features: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 htt pni ssse3 fma cx16 sse4_1 sse4_2 movebe popcnt aes f16c rdrandsyscall nx lm avx avx2 sse4a osvw wdt topx page1gb rdtscp fsgsbase bmi1 smep bmi2

On the same machine with windows 10 installed (no BIOS change) it downloads and runs jobs OK
https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10834818
10) Questions and Answers : Getting started : CPU does not have hardware virtualization support? (Message 48415)
Posted 10 Aug 2023 by broz69
Post:
Hi,

Something like this happened to me just about 3 hours ago when my computer rebooted after having installed kb5029263 https://support.microsoft.com/en-gb/topic/august-8-2023-kb5029263-os-build-22621-2134-f8d4d3de-47c1-40e1-a2e6-97c2770ee2e8

Today (10th Aug 2023) before 15:40 CEST everything was OK and I was running Theory jobs normally, now I can't get anymore jobs.

I tried Yeti's checklist https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4161#29359 but no luck.
Host ID https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10616627
https://lhcathome.cern.ch/lhcathome/results.php?hostid=10616627
OS: Win 11 22H2 (OS Build 22621.2134)
AMD Ryzen 5 3400G, 32GB RAM, 140GB disk cca
BOINC: 7.22.2 (x64)
Virtualbox: 7.0.10 r158379 (Qt5.15.2) with corresponding Extensions installed
No Docker, no Hyper-V, no Virtual Machine Platform, no WSL or WSL2 or any other virtualisation technology
LeoMoon CPU-V shows two green checks AMD-v Supported and AMD-v enabled
SVM is enabled in BIOS, NX is enabled in BIOS
client_state.xml has this <p_vm_extensions_disabled>0</p_vm_extensions_disabled>
antivirus is Microsoft Defender (I tried some things from here https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=5797 but no success.

Any other suggestion?

Best regards
11) Message boards : Number crunching : Peer certificate cannot be authenticated with given CA certificates (Message 42790)
Posted 3 Jun 2020 by broz69
Post:
found this on another thread - it worked for me.

I analyzed the ca-bundle.crt file and found out that AddTrust External Root certificate expired today.
I removed the expired certificate part from the file and now everything works normal for me again.

Here is a guide to a quick fix:
Backup all your sensitive data first, This is only tested on 1 computer so far.
Exit BOINC
Open file manager and go to C:ProgramFilesBOINC or wherever you have installed BOINC.
Make a backup copy of ca-bundle.crt just in case my instuctions screw up something.
Right click on ca-bundle.crt and open it with Notepad
Scroll down to AddTrust External Root, Below this is the expired certificate.
Delete everything from -----BEGIN CERTIFICATE----- to -----END CERTIFICATE----- including the begin and end lines.
Save the file
Start BOINC and try again.

happy crunching



Hi,

I did this and it solved the issue. No need to update BOINC client (but I guess that updating is the prefered long term solution). If you don't want to update then just delete the expired root CA from ca-bundle.crt and it'll be OK.
12) Message boards : Number crunching : Peer certificate cannot be authenticated with given CA certificates (Message 42723)
Posted 31 May 2020 by broz69
Post:
Thank you. I found it in a directory where boinc.exe is situated. I changed the one with the one from github, restarted the BOINC client and the same result:

31/05/2020 17:43:20 | LHC@home | Sending scheduler request: Requested by user.
31/05/2020 17:43:20 | LHC@home | Reporting 73 completed tasks
31/05/2020 17:43:20 | LHC@home | Requesting new tasks for CPU and AMD/ATI GPU
31/05/2020 17:43:21 | LHC@home | Scheduler request failed: Peer certificate cannot be authenticated with given CA certificates
31/05/2020 17:43:23 | | Project communication failed: attempting access to reference site
31/05/2020 17:43:25 | | Internet access OK - project servers may be temporarily down.
31/05/2020 17:44:42 | LHC@home | Fetching scheduler list
31/05/2020 17:44:44 | | Project communication failed: attempting access to reference site
31/05/2020 17:44:45 | | Internet access OK - project servers may be temporarily down.

I compared the two ca-bundle.crt files and the content is exactly the same (apart from date and time modified).
13) Message boards : Number crunching : Peer certificate cannot be authenticated with given CA certificates (Message 42721)
Posted 31 May 2020 by broz69
Post:
OK.

I downloaded the file ca-bundle.crt from github, put it in BOINC direcotry, restarted BOINC client and still get the same error "31/05/2020 13:47:03 | LHC@home | Scheduler request failed: Peer certificate cannot be authenticated with given CA certificates"

31/05/2020 13:46:50 | | Starting BOINC client version 7.16.5 for windows_x86_64
31/05/2020 13:46:50 | | log flags: file_xfer, sched_ops, task
31/05/2020 13:46:50 | | Libraries: libcurl/7.47.1 OpenSSL/1.0.2s zlib/1.2.8

What else can I do?
14) Message boards : Number crunching : Peer certificate cannot be authenticated with given CA certificates (Message 42719)
Posted 31 May 2020 by broz69
Post:
Hi,

I don't have ca-bundle.crt on my Windows 10 computer in BOINC directory. So where do root certificates come from in this case?

What is wierd is that some ATLAS jobs uploaded the results to LHC but the job in BOINC still shows "Ready to report"...

So what else can I do?
15) Message boards : CMS Application : CMS&Atlas host disk problem (Message 41834)
Posted 6 Mar 2020 by broz69
Post:
Hello,

I still have problems with LHC@Home. Mainly with Atlas and CMS VBox tasks. The problem lies in how a combination of Boinc and LHC tasks works with different disks.
Computer ID: 10570926
8 processors allowed (meaning 8 simultaneous tasks with single processor)

Here's what I've found so far. I broke down the whole process in some steps:
1. Boinc contacts the server and downloads tasks (in case of LHC it downloads many tasks - like 8 or so - at the same time)
2. Boinc starts the task or tasks (depending if they are multi-threaded or not)
3. the LHC first copies the disk image to the BOINC/slots/ directory
4. after image is copied it registers a VM in VBox Manager and sets up parameters (base memory, processors, attaches disks etc)
5. VM starts the boot-up process
6. VM starts and does it's work
7. VM finishes the work and the VM shuts-down
8. after VM shuts down there is an extra 5-6 min that I don't know exactly what's going on (there's very little CPU activity but no disk nor ethernet activity... I think some kind of result preparation?)
9. then follows VM deregistration from VBox Manager and a computational error comes up in Boinc Manager (this error is not so important right now)
10. reporting result to the LHC server

In my case:
step 1 is not critical as the internet connection is slower than disk data speed
step 2 - after jobs downloaded Boinc Manager started 8 CMS tasks at the same time (see below for detailed analysis)
Atlas disk image is around 2,54 GB, CMS disk image is around 2,8 GB.
Starting eight Atlas or CMS jobs at the same time is not advisable in my case as writing 8 VM disk images to BOINC/slots/ directory completly overwhelms the disk for a long time. The disk cannot handle so many write requests.
As is seen below different disks have different write queues. SSDs and even SD cards can handle 8 write requests, but HDDs cannot.
Is it possible to do one of the following:
    increase a time-out during VM boot-up process. When VM starts (step 5 above) it looks for a boot disk. If the disk is not there or for some reason not yet ready (host disk still busy with write operations) the VM ends up in rescue console with an error message "Unable to mount root device /dev/disk/by-label/UROOT!" Longer time-out would avoid this situation.


or

    introduce a parameter and a mechanism in Boinc that would start VBox tasks with a certain delay (step 2 above). This would allow the host disk to finish write operations and when VM is starting also boot disk would be ready.



The situation described here is not only in case of starting new jobs but also when Boinc switches between jobs. The situation arised when Boinc switched from Milkyway@home N-Body Simulation 1.76 (mt) 8 CPU task to 8 single CPU CMS tasks.
Or when Boinc is switching from 8 single CPU CMS to one 8 CPU task. In this case VBox needs to pause 8 VMs and again the disk is active 100% of time for a long period.

The below CMS tasks are listed in a sequence how Boinc started them.

CMS - 2 VMs started at the same time, 2 failed tasks - BOINC/slots/ on HDD
HDD: WD7500AADS-00M2B0 (SATA2, 3Gbps), write speed 60MBps
CMS_3945628_1583445017.758018_0
CMS_3945649_1583445017.880033_0

CMS - 8 VMs started at the same time, 5 failed - BOINC/slots/ on SATA HDD
HDD: WD7500BPKX-75HPJT0 (SATA3, 6GBps), write speed between 60-80MBps, 5:30min 100% active time
CMS_66148_1583476897.820370_0 - run
CMS_77654_1583478099.044842_0 - failed
CMS_95128_1583479600.855668_0 - failed
CMS_136087_1583483505.320420_0 - failed
CMS_121063_1583482003.641115_0 - failed
CMS_153320_1583485307.708883_0 - failed
CMS_199719_1583489516.247794_0 - run
CMS_180480_1583487712.076585_0 - run
Failed VMs ended up in rescue console.
Failed VMs were later reset in VBox manager and did run around 18min. 5 minutes later (23 min from start) there was Computation error in Boinc Manager.

CMS - 8 VMs started at the same time, 7 failed - BOINC/slots/ on SATA HDD
HDD: WD7500BPKX-75HPJT0 (SATA3, 6GBps), write speed between 60-80MBps, 6:15min 100% active time
CMS_139026_1583483805.806430_0 - failed
CMS_196459_1583489215.997645_0 - failed
CMS_196461_1583489216.009555_0 - failed
CMS_124781_1583482304.049943_0 - failed
CMS_150047_1583485007.485152_0 - failed
CMS_139018_1583483805.751000_0 - failed
CMS_150051_1583485007.517990_0 - failed
CMS_171598_1583486810.747297_0 - run
Failed VMs ended up in rescue console and were left in that state.
After 20min from start they were cancelled (by Boinc?LHC?) and after 25min from start there were Computation Errors in Boinc Manager.

CMS - 8 VMs started at the same time, 8 non failed tasks - BOINC/slots/ on USB SD/MMC card
SD card is SanDisk Extreme PRO microSDXC UHS-I, 128GB (up-to 90MBps write speed, up-to 170MBps read speed)
USB 3.0 (5Gbps), the card didn't show up in Windows Task Manager so I couldn't measure times or write speeds.

CMS_3974215_1583448025.207379_0
CMS_3974237_1583448025.327024_0
CMS_3922210_1583442611.886740_0
CMS_3922207_1583442611.871799_0
CMS_3971270_1583447724.078240_0
CMS_3922219_1583442611.906543_0
CMS_3910580_1583441410.016243_0
CMS_113714_1583481402.844904_0
start-up sequence of above tasks was linear, tasks started one after the other with aprox.60 sec from one start to the next.

CMS - 8 VMs started at the same time, 8 non failed - BOINC/slots/ on SATA3 SSD
SSD: Samsung SSD 850 EVO (SATA3, 6Gbps), write speed 500MBps, around 60-75s 100% active time
CMS_100680_1583480201.538717_0
CMS_132622_1583483204.854043_0
CMS_121061_1583482003.629244_0
CMS_139006_1583483805.668606_0
CMS_130340_1583482904.579675_0
CMS_136085_1583483505.295626_0
CMS_117122_1583481703.003336_0
CMS_95126_1583479600.843608_0
tasks run around 13 min and then VMs powered off
after around 18 min from start Computation Error in Boinc Manager

16) Message boards : ATLAS application : ATLAS problem - long running but not using any CPU (Message 41610)
Posted 17 Feb 2020 by broz69
Post:
Hi,

Another thing that I noticed - ATLAS job needs more than 160 sec from the moment I resume the job in BOINC Manager to start running. In this time the disk is active 100% of time. Since the BOINC setup is that it reads the image from BOINC\projects\lhcathome.cern.ch_lhcathome and copies them to BOINC\slots. All the read and write operations are on one phisycal disk. The disk is WDC WD3200BEVT-22ZCT0.
The solution for me would also be if somehow I could say to VBox/BOINC that the repository of LHC images is on one disk and working set of disks are somewhere else. I have other three hard disks that I could use to spread the disk load...

Best regards.
17) Message boards : ATLAS application : ATLAS problem - long running but not using any CPU (Message 41608)
Posted 16 Feb 2020 by broz69
Post:
Hello,

Thank you for your answer. This behaviour is exactly what I've seen in the last 4 hours; with a test machine on - no shutdown. BOINC Manager was starting some VBox VMs, stopping others and in the mean time made a bit of a mess. I have 2 Theory jobs that have status "Postponed" and in the VBox manager they are defined/created but just partially - both of them have no disk attached - under "Storage" the disk part is empty.

Then at certain point BOINC Manager decided to switch jobs from Theory to ATLAS. So it paused all Theory jobs and started an ATLAS job. I have now one ATLAS job (d4HODmjyBNwn9Rq4apoT9bVoABFKDmABFKDmbQmVDmABFKDmypIWIn_1) that is defined in VBox and when it's running I can see three different screens in BOINC Manager in VM console using alt+F1, alt+F2 and alt+F3. The only problem is that alt+F2 (ATLAS Event Progress Monitoring) is showing a progress screen where all the numbers are shown as N/A. It seems that VM started but somehow failed to trigger the start of calculations. BOINC Manager shows job as running.

I can't say that what you are saying about the behaviour of BOINC Manager is desirable. But at least I know I have to be careful when shutting down the computer.

Thank you for your effort and explaining this to me.


Correction - it seems like ATLAS job needed almost 20 min to get the data and while I was writing the answer above it started crunching numbers. So it's not stalled...
18) Message boards : ATLAS application : ATLAS problem - long running but not using any CPU (Message 41607)
Posted 16 Feb 2020 by broz69
Post:
Hello,

Thank you for your answer. This behaviour is exactly what I've seen in the last 4 hours; with a test machine on - no shutdown. BOINC Manager was starting some VBox VMs, stopping others and in the mean time made a bit of a mess. I have 2 Theory jobs that have status "Postponed" and in the VBox manager they are defined/created but just partially - both of them have no disk attached - under "Storage" the disk part is empty.

Then at certain point BOINC Manager decided to switch jobs from Theory to ATLAS. So it paused all Theory jobs and started an ATLAS job. I have now one ATLAS job (d4HODmjyBNwn9Rq4apoT9bVoABFKDmABFKDmbQmVDmABFKDmypIWIn_1) that is defined in VBox and when it's running I can see three different screens in BOINC Manager in VM console using alt+F1, alt+F2 and alt+F3. The only problem is that alt+F2 (ATLAS Event Progress Monitoring) is showing a progress screen where all the numbers are shown as N/A. It seems that VM started but somehow failed to trigger the start of calculations. BOINC Manager shows job as running.

I can't say that what you are saying about the behaviour of BOINC Manager is desirable. But at least I know I have to be careful when shutting down the computer.

Thank you for your effort and explaining this to me.
19) Message boards : ATLAS application : ATLAS problem - long running but not using any CPU (Message 41605)
Posted 16 Feb 2020 by broz69
Post:
2020-02-16 03:04:18 (9184): Required extension pack not installed, remote desktop not enabled.
You need to install Extension pack in Virtualbox to see more.
Also is Yeti's Checklist in the Atlas-Folder very useful for your first experience with Atlas.
When a Atlas-Task is running more than 4-6 hours, there is something wrong with the installation.


Hi again,

Last weekend I observed the following behaviour on computer ID 10570926. The computer was shutdown and the next morning I turned it on. The shutdown procedure was nothing special (I didn't do anything special to running LHC jobs through VBox). There were some Theory and CMS jobs running at the time when I initiated a shutdown. When the machine came up all the VMs started at the same time. I checked VM console and all of them were in emergency shell. I aborted the jobs (all of them at the same time). That's when ATLAS jobs started, all at the same time. After a while I checked VM console in BOINC and all of them were in emergency shell. I aborted the jobs. This was Feb 9.

This weekend I activated my testing machine ID: 10616627. I installed new Win10 1903 build 18362.657, BOINC 7.14.2 (x64) and VBox 6.1.2 r135662 (Qt5.6.2). When I pressed Allow new tasks, BOINC downloaded cca 16 Theory jobs and 4 ATLAS jobs. It started 4 Theory jobs at the same time. I checked VM console in BOINC and 4 jobs were in emergency shell:
* Welcome to micro-Cern-VM
* Release 2018.10-1.cernvm.x86_64

[INF] Loading predefined modules... check
[INF] Starting networking... check
[INF] Getting time from pool.ntp.org... check
[INF] Mounting root filesystem...mount: mounting /dev/disk/by-label/UROOT on /root.rw failed: Input/output error
[ERR] Unable to mount root device /dev/disk/by-label/UROOT!
[INF] Entering rescue console
etc...

And this was exactly the same behaviour on both machines! Both machines have SATA disk for LHC. One has 750GB WD Black and the other 320GB WD Blue (on Standard SATA AHCI Controler, driver from Microsoft ver. 10.0.18362.1). My guess is that starting many VMs at the same time produces some kind of IO errors and then VMs simply stay in that state. And BOINC doesn't know it and simply lets them run forever.

Is there any setting that I can use to delay starting the VMs? It seems that starting many VMs at the same time produces IO errors. Now would be interesting to know if it's VBox or is it that OS in VM has some time-outs that are too low...

On my test machine Theory VM needs around 60-80 sec to copy the VDI image and then another 10-20 sec to start running. So in my case 120 sec time between starting different VMs would be OK. On the other hand ATLAS has bigger VM image so it takes a bit more time. The only thing is I don't know where to set it up - if it's even possible. I know it's possible to do it in Hyper-V but I don't know how to do it in BOINC/VBox combo...
20) Message boards : ATLAS application : ATLAS problem - long running but not using any CPU (Message 41500)
Posted 9 Feb 2020 by broz69
Post:
This logfile states that your VMs use only 2241MB although a 4-core setup requires 6600MB:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=262028302
2020-02-05 11:49:38 (6456): Setting Memory Size for VM. (2241MB)
2020-02-05 11:49:39 (6456): Setting CPU Count for VM. (4)

You may switch to ALT-F3 of a running VM to check what RAM size is reported by top.
A local app_config.xml might set this wrong value.


Even your linux computer reports weird logfile entries:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=262024433
2020-02-05 15:27:27 (23810): Detected: VirtualBox VboxManage Interface (Version: 5.2.34)
.
.
.
2020-02-05 15:27:32 (23810): Guest Log: BIOS: VirtualBox 5.2.33

You may upgrade VirtualBox or at least keep the version of your vbox additions in sync with your main VirtualBox version.

Hi,

I figured out what was wrong. It happened this morning with WU ID 132365918.
My computer is not on 24/7. Every evening instead of shutting it down I use sleep function. This morning after I turned the computer on I've checked the BOINC queue and all the machines that were running (there were 8 of them, all LHC jobs). The ATLAS job for WU ID 132365918 was in emergency mode. BOINC manager was showing it as running but VM console showed a lot of I/O error, dev sda. The host disk (physical disk in host system) seems OK, no errors there. So it must have been something connected with how VM reacts on waking from sleep. Maybe the host disk was not ready yet when BOINC or VBox were already waking up the machine with ATLAS job? The funny part was that it happend only to ATLAS job, 7 Theory jobs seem to be OK.


Hi,

I have another 2 ATLAS jobs that seem to be stalled - running but no CPU used - WU IDs 132365983 and 132365606.
Last night I shutdown the machine. This morning I switched it back on and I checked LHC jobs. At the moment there seem to be 5 Theory jobs running and two ATLAS (2 CPUs) jobs. That would mean that BOINC is using 9 CPUs. Which is a bit funny as I only allow 8 CPUs to be used and this hasn't changed since friday.

current app_config (last changed 6-2-2020 12:13):
<app_config>
  <app>
    <name>ATLAS</name>
    <max_concurrent>4</max_concurrent>
  </app>
  <app_version>
    <app_name>ATLAS</app_name>
    <plan_class>vbox64_mt_mcore_atlas</plan_class>
    <avg_ncpus>2.0</avg_ncpus>
    <cmdline>--nthreads 2</cmdline>
  </app_version>
</app_config>


current ATLAS_vbox_2.00_job (last changed 7-2-2020 10:09):
<vbox_job>
  <os_name>Linux26_64</os_name>
  <memory_size_mb>5120</memory_size_mb>
  <enable_network/>
  <enable_remotedesktop/>
  <enable_shared_directory/>
  <copy_to_shared>init_data.xml</copy_to_shared>
  <completion_trigger_file>atlas_done</completion_trigger_file>
  <disable_automatic_checkpoints/> 
  <enable_vm_savestate_usage/>
  <minimum_checkpoint_interval>900</minimum_checkpoint_interval>
  <pf_guest_port>80</pf_guest_port>
</vbox_job>


Next 20


©2024 CERN