1)
Message boards :
Theory Application :
Theory Error fail to compile yoda2flat-split
(Message 49040)
Posted 15 Dec 2023 by broz69 Post: Hi, I see some errors in Theory jobs (Win, VBox) on my computer (hostid= 10834815). g++: /lib64/libc.so.6: version `GLIBC_2.14' not found (required by g++) make: *** [yoda2flat-split.exe] Error 1 make: Leaving directory `/shared/rivetvm' ERROR: fail to compile yoda2flat-split Anything happening with VM? Best regards. |
2)
Questions and Answers :
Windows :
Theory and CMS jobs fail after resume (Win + VBox)
(Message 48971)
Posted 30 Nov 2023 by broz69 Post: Hi, Thank you for this answer. I can't say that I like it. I think that 8 is not a lot of tasks, compared to others that have even more cores... And I was also checking other users with similar configurations and all of them had same issues at some time in the past. So I can't say that my problem is unique. In the mean time I did the following: - I created a RAM disk with ImDisk (RAM speed is 3200MHz) - in Oracle VM VirtualBox Manager I created and installed 8 Linux machines (2GB RAM, 1 core) and saved these 8 virtual disks (dynamic size) to RAM disk - I created a group with all of them - then I run all 8 of them at the same time (as a group) - no problem - I paused all 8 of them - no problem - I resumed all 8 of them - no problem - I saved all 8 of them - no problem - I resumed all 8 of them - one machine was corrupted and did not start correctly The problem after all these steps above was, that Oracle VM VirtualBox Manager was consuming 100% of processor (all 8C/16T at 100%). I had to suspend all VMs and restart the manager. Now back to my problem. If I suspend and then resume CMS job it seems that it continues running OK. I tried it and it's OK. As I see the problem lies in VBox that cannot handle suspend/resume of "large" numbers of VMs or in Boinc that does something that VBox can't handle. Can't we ask Boinc not to suspend/resume VM jobs all at the same time but like in steps of 1 with some delay between them? Or why does Boinc even do this when we know that VBox can't handle suspend/resume of "large" numbers of VMs at the same time? Or why does Boinc even start Atlas jobs when we know that there'll be problems with CMS and Theory jobs running? Or can we ask Oracle to fix VBox Manager? What's your view on this? The problem that I'd like to solve is that when Theory and CMS jobs resume they do that without error. In case of error the time and energy spent on failed jobs is useless and on a top we don't get any credits ;) Best regards. |
3)
Questions and Answers :
Windows :
Theory and CMS jobs fail after resume (Win + VBox)
(Message 48969)
Posted 29 Nov 2023 by broz69 Post: Hi, I thought I could share this with you. I am seeing undesired behaviour on two my LHC crunching machines. Both of them Windows 11 with VBox (hostid= 10834815 and hostid=10616627). The situation is as follows: - some Theory and CMS jobs run - BOINC fetches some new work and these tasks happen to be Atlas - when tasks download BOINC decides (probably based on some algorithm connected with deadline) that Atlas jobs have priority - Theory and CMS jobs get either paused either saved in VBox (which is already strange - why some tasks save and others pause?) - Atlas jobs run and finish OK - after Atlas jobs finish, some old jobs continue OK, some not and throw this error (on hostid=10834815 only two tasks finished OK, other 6 threw error):
BOINC setting "Leave non-GPU tasks in memory while suspended" = false. I have seen this behaviour for some time now and I'm following it more closely in the last few days and it is always the same (if you check failed jobs for these two hostids they are always because of this switching between different LHC jobs). I'm not using BOINC for any other computation. Linux native seems OK. Best regards. |
4)
Message boards :
Theory Application :
Theory Failure Ratio Explodes
(Message 48596)
Posted 20 Sep 2023 by broz69 Post: Hi, It's what computezrmle wrote 2638 jobs fail and 2390 are OK, In the morning the jobs were not all 2390: Theory_2638 - 35 failed Theory_2637 - 26 failed Theory_2636 - 24 failed Theory_2390 - 1 OK In the evening I got some Theory jobs, all of them 2390. Theory_2390-1109174-576, Theory_2390-1100306-576, Theory_2390-1140982-576, Theory_2390-1099685-576 finished OK without proxy. But so did others with proxy. So it doesn't seem a problem with proxy, vbox or Boinc. So I'll just wait that people at LHC find the solution. Thanks. |
5)
Message boards :
Theory Application :
Theory Failure Ratio Explodes
(Message 48592)
Posted 20 Sep 2023 by broz69 Post: At the moment both my Windows computers are set to "No new tasks". Only Linux is running (native). Linux only has Theory_2390 jobs. |
6)
Message boards :
Theory Application :
Theory Failure Ratio Explodes
(Message 48591)
Posted 20 Sep 2023 by broz69 Post: They're visible now. |
7)
Message boards :
Theory Application :
Theory Failure Ratio Explodes
(Message 48588)
Posted 20 Sep 2023 by broz69 Post: Hi, My Theory jobs are still failing (hostID=10834815). Different errors runRivet Setting environment... grep: /etc/redhat-release: No such file or directory ./runRivet.sh: line 33: /cvmfs/sft.cern.ch/lcg/releases/LCG_102b_ATLAS_28/../gcc/11.3.0/x86_64-slc6/setup.sh: No such file or directory ERROR: fail to set environment (gcc) or make: *** [yoda2flat-split.exe] Error 1 make: Leaving directory `/shared/rivetvm' ERROR: fail to compile yoda2flat-split or just hangs at Running job shoud appear here. [INFO] Container 'runc' finished with status code 1. When I shutdown the VM it reports job as finished (which is strange)... Is there any special procedure in place to recover from this? Best regards. |
8)
Questions and Answers :
Getting started :
CPU does not have hardware virtualization support?
(Message 48422)
Posted 11 Aug 2023 by broz69 Post: YES! I somehow missed this. Apparently the Update chaneged that. But I somehow missed it from the list... Thank you for your help! And have a nice weekend! |
9)
Questions and Answers :
Getting started :
CPU does not have hardware virtualization support?
(Message 48418)
Posted 10 Aug 2023 by broz69 Post: Hyper-V is off. AMD-V is enabled in BIOS. If I create a new virtual machine in Virtualbox and I start it - it starts without problems, no error. As I said - I was able to run jobs until 15:40 CEST when computer rebooted after Windows Update. After that all my Theory jobs failed and I didn't get any new jobs. I installed new windows 11 on another machine and I have the same issue. Unfortunately I wasn't fast enough and win 11 simply installed all the latest updates... https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10834815 This is from new machine's BOINC event log: 10/08/2023 22:38:26 | | Processor: 16 AuthenticAMD AMD Ryzen 7 5700G with Radeon Graphics [Family 25 Model 80 Stepping 0] 10/08/2023 22:38:26 | | Processor features: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 htt pni ssse3 fma cx16 sse4_1 sse4_2 movebe popcnt aes f16c rdrandsyscall nx lm avx avx2 sse4a osvw wdt topx page1gb rdtscp fsgsbase bmi1 smep bmi2 On the same machine with windows 10 installed (no BIOS change) it downloads and runs jobs OK https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10834818 |
10)
Questions and Answers :
Getting started :
CPU does not have hardware virtualization support?
(Message 48415)
Posted 10 Aug 2023 by broz69 Post: Hi, Something like this happened to me just about 3 hours ago when my computer rebooted after having installed kb5029263 https://support.microsoft.com/en-gb/topic/august-8-2023-kb5029263-os-build-22621-2134-f8d4d3de-47c1-40e1-a2e6-97c2770ee2e8 Today (10th Aug 2023) before 15:40 CEST everything was OK and I was running Theory jobs normally, now I can't get anymore jobs. I tried Yeti's checklist https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4161#29359 but no luck. Host ID https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10616627 https://lhcathome.cern.ch/lhcathome/results.php?hostid=10616627 OS: Win 11 22H2 (OS Build 22621.2134) AMD Ryzen 5 3400G, 32GB RAM, 140GB disk cca BOINC: 7.22.2 (x64) Virtualbox: 7.0.10 r158379 (Qt5.15.2) with corresponding Extensions installed No Docker, no Hyper-V, no Virtual Machine Platform, no WSL or WSL2 or any other virtualisation technology LeoMoon CPU-V shows two green checks AMD-v Supported and AMD-v enabled SVM is enabled in BIOS, NX is enabled in BIOS client_state.xml has this <p_vm_extensions_disabled>0</p_vm_extensions_disabled> antivirus is Microsoft Defender (I tried some things from here https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=5797 but no success. Any other suggestion? Best regards |
11)
Message boards :
Number crunching :
Peer certificate cannot be authenticated with given CA certificates
(Message 42790)
Posted 3 Jun 2020 by broz69 Post: found this on another thread - it worked for me. Hi, I did this and it solved the issue. No need to update BOINC client (but I guess that updating is the prefered long term solution). If you don't want to update then just delete the expired root CA from ca-bundle.crt and it'll be OK. |
12)
Message boards :
Number crunching :
Peer certificate cannot be authenticated with given CA certificates
(Message 42723)
Posted 31 May 2020 by broz69 Post: Thank you. I found it in a directory where boinc.exe is situated. I changed the one with the one from github, restarted the BOINC client and the same result: 31/05/2020 17:43:20 | LHC@home | Sending scheduler request: Requested by user. 31/05/2020 17:43:20 | LHC@home | Reporting 73 completed tasks 31/05/2020 17:43:20 | LHC@home | Requesting new tasks for CPU and AMD/ATI GPU 31/05/2020 17:43:21 | LHC@home | Scheduler request failed: Peer certificate cannot be authenticated with given CA certificates 31/05/2020 17:43:23 | | Project communication failed: attempting access to reference site 31/05/2020 17:43:25 | | Internet access OK - project servers may be temporarily down. 31/05/2020 17:44:42 | LHC@home | Fetching scheduler list 31/05/2020 17:44:44 | | Project communication failed: attempting access to reference site 31/05/2020 17:44:45 | | Internet access OK - project servers may be temporarily down. I compared the two ca-bundle.crt files and the content is exactly the same (apart from date and time modified). |
13)
Message boards :
Number crunching :
Peer certificate cannot be authenticated with given CA certificates
(Message 42721)
Posted 31 May 2020 by broz69 Post: OK. I downloaded the file ca-bundle.crt from github, put it in BOINC direcotry, restarted BOINC client and still get the same error "31/05/2020 13:47:03 | LHC@home | Scheduler request failed: Peer certificate cannot be authenticated with given CA certificates" 31/05/2020 13:46:50 | | Starting BOINC client version 7.16.5 for windows_x86_64 31/05/2020 13:46:50 | | log flags: file_xfer, sched_ops, task 31/05/2020 13:46:50 | | Libraries: libcurl/7.47.1 OpenSSL/1.0.2s zlib/1.2.8 What else can I do? |
14)
Message boards :
Number crunching :
Peer certificate cannot be authenticated with given CA certificates
(Message 42719)
Posted 31 May 2020 by broz69 Post: Hi, I don't have ca-bundle.crt on my Windows 10 computer in BOINC directory. So where do root certificates come from in this case? What is wierd is that some ATLAS jobs uploaded the results to LHC but the job in BOINC still shows "Ready to report"... So what else can I do? |
15)
Message boards :
CMS Application :
CMS&Atlas host disk problem
(Message 41834)
Posted 6 Mar 2020 by broz69 Post: Hello, I still have problems with LHC@Home. Mainly with Atlas and CMS VBox tasks. The problem lies in how a combination of Boinc and LHC tasks works with different disks. Computer ID: 10570926 8 processors allowed (meaning 8 simultaneous tasks with single processor) Here's what I've found so far. I broke down the whole process in some steps: 1. Boinc contacts the server and downloads tasks (in case of LHC it downloads many tasks - like 8 or so - at the same time) 2. Boinc starts the task or tasks (depending if they are multi-threaded or not) 3. the LHC first copies the disk image to the BOINC/slots/ directory 4. after image is copied it registers a VM in VBox Manager and sets up parameters (base memory, processors, attaches disks etc) 5. VM starts the boot-up process 6. VM starts and does it's work 7. VM finishes the work and the VM shuts-down 8. after VM shuts down there is an extra 5-6 min that I don't know exactly what's going on (there's very little CPU activity but no disk nor ethernet activity... I think some kind of result preparation?) 9. then follows VM deregistration from VBox Manager and a computational error comes up in Boinc Manager (this error is not so important right now) 10. reporting result to the LHC server In my case: step 1 is not critical as the internet connection is slower than disk data speed step 2 - after jobs downloaded Boinc Manager started 8 CMS tasks at the same time (see below for detailed analysis) Atlas disk image is around 2,54 GB, CMS disk image is around 2,8 GB. Starting eight Atlas or CMS jobs at the same time is not advisable in my case as writing 8 VM disk images to BOINC/slots/ directory completly overwhelms the disk for a long time. The disk cannot handle so many write requests. As is seen below different disks have different write queues. SSDs and even SD cards can handle 8 write requests, but HDDs cannot. Is it possible to do one of the following:
|
16)
Message boards :
ATLAS application :
ATLAS problem - long running but not using any CPU
(Message 41610)
Posted 17 Feb 2020 by broz69 Post: Hi, Another thing that I noticed - ATLAS job needs more than 160 sec from the moment I resume the job in BOINC Manager to start running. In this time the disk is active 100% of time. Since the BOINC setup is that it reads the image from BOINC\projects\lhcathome.cern.ch_lhcathome and copies them to BOINC\slots. All the read and write operations are on one phisycal disk. The disk is WDC WD3200BEVT-22ZCT0. The solution for me would also be if somehow I could say to VBox/BOINC that the repository of LHC images is on one disk and working set of disks are somewhere else. I have other three hard disks that I could use to spread the disk load... Best regards. |
17)
Message boards :
ATLAS application :
ATLAS problem - long running but not using any CPU
(Message 41608)
Posted 16 Feb 2020 by broz69 Post: Hello, Correction - it seems like ATLAS job needed almost 20 min to get the data and while I was writing the answer above it started crunching numbers. So it's not stalled... |
18)
Message boards :
ATLAS application :
ATLAS problem - long running but not using any CPU
(Message 41607)
Posted 16 Feb 2020 by broz69 Post: Hello, Thank you for your answer. This behaviour is exactly what I've seen in the last 4 hours; with a test machine on - no shutdown. BOINC Manager was starting some VBox VMs, stopping others and in the mean time made a bit of a mess. I have 2 Theory jobs that have status "Postponed" and in the VBox manager they are defined/created but just partially - both of them have no disk attached - under "Storage" the disk part is empty. Then at certain point BOINC Manager decided to switch jobs from Theory to ATLAS. So it paused all Theory jobs and started an ATLAS job. I have now one ATLAS job (d4HODmjyBNwn9Rq4apoT9bVoABFKDmABFKDmbQmVDmABFKDmypIWIn_1) that is defined in VBox and when it's running I can see three different screens in BOINC Manager in VM console using alt+F1, alt+F2 and alt+F3. The only problem is that alt+F2 (ATLAS Event Progress Monitoring) is showing a progress screen where all the numbers are shown as N/A. It seems that VM started but somehow failed to trigger the start of calculations. BOINC Manager shows job as running. I can't say that what you are saying about the behaviour of BOINC Manager is desirable. But at least I know I have to be careful when shutting down the computer. Thank you for your effort and explaining this to me. |
19)
Message boards :
ATLAS application :
ATLAS problem - long running but not using any CPU
(Message 41605)
Posted 16 Feb 2020 by broz69 Post: 2020-02-16 03:04:18 (9184): Required extension pack not installed, remote desktop not enabled. Hi again, Last weekend I observed the following behaviour on computer ID 10570926. The computer was shutdown and the next morning I turned it on. The shutdown procedure was nothing special (I didn't do anything special to running LHC jobs through VBox). There were some Theory and CMS jobs running at the time when I initiated a shutdown. When the machine came up all the VMs started at the same time. I checked VM console and all of them were in emergency shell. I aborted the jobs (all of them at the same time). That's when ATLAS jobs started, all at the same time. After a while I checked VM console in BOINC and all of them were in emergency shell. I aborted the jobs. This was Feb 9. This weekend I activated my testing machine ID: 10616627. I installed new Win10 1903 build 18362.657, BOINC 7.14.2 (x64) and VBox 6.1.2 r135662 (Qt5.6.2). When I pressed Allow new tasks, BOINC downloaded cca 16 Theory jobs and 4 ATLAS jobs. It started 4 Theory jobs at the same time. I checked VM console in BOINC and 4 jobs were in emergency shell: * Welcome to micro-Cern-VM * Release 2018.10-1.cernvm.x86_64 [INF] Loading predefined modules... check [INF] Starting networking... check [INF] Getting time from pool.ntp.org... check [INF] Mounting root filesystem...mount: mounting /dev/disk/by-label/UROOT on /root.rw failed: Input/output error [ERR] Unable to mount root device /dev/disk/by-label/UROOT! [INF] Entering rescue console etc... And this was exactly the same behaviour on both machines! Both machines have SATA disk for LHC. One has 750GB WD Black and the other 320GB WD Blue (on Standard SATA AHCI Controler, driver from Microsoft ver. 10.0.18362.1). My guess is that starting many VMs at the same time produces some kind of IO errors and then VMs simply stay in that state. And BOINC doesn't know it and simply lets them run forever. Is there any setting that I can use to delay starting the VMs? It seems that starting many VMs at the same time produces IO errors. Now would be interesting to know if it's VBox or is it that OS in VM has some time-outs that are too low... On my test machine Theory VM needs around 60-80 sec to copy the VDI image and then another 10-20 sec to start running. So in my case 120 sec time between starting different VMs would be OK. On the other hand ATLAS has bigger VM image so it takes a bit more time. The only thing is I don't know where to set it up - if it's even possible. I know it's possible to do it in Hyper-V but I don't know how to do it in BOINC/VBox combo... |
20)
Message boards :
ATLAS application :
ATLAS problem - long running but not using any CPU
(Message 41500)
Posted 9 Feb 2020 by broz69 Post: This logfile states that your VMs use only 2241MB although a 4-core setup requires 6600MB: Hi, I have another 2 ATLAS jobs that seem to be stalled - running but no CPU used - WU IDs 132365983 and 132365606. Last night I shutdown the machine. This morning I switched it back on and I checked LHC jobs. At the moment there seem to be 5 Theory jobs running and two ATLAS (2 CPUs) jobs. That would mean that BOINC is using 9 CPUs. Which is a bit funny as I only allow 8 CPUs to be used and this hasn't changed since friday. current app_config (last changed 6-2-2020 12:13): <app_config> <app> <name>ATLAS</name> <max_concurrent>4</max_concurrent> </app> <app_version> <app_name>ATLAS</app_name> <plan_class>vbox64_mt_mcore_atlas</plan_class> <avg_ncpus>2.0</avg_ncpus> <cmdline>--nthreads 2</cmdline> </app_version> </app_config> current ATLAS_vbox_2.00_job (last changed 7-2-2020 10:09): <vbox_job> <os_name>Linux26_64</os_name> <memory_size_mb>5120</memory_size_mb> <enable_network/> <enable_remotedesktop/> <enable_shared_directory/> <copy_to_shared>init_data.xml</copy_to_shared> <completion_trigger_file>atlas_done</completion_trigger_file> <disable_automatic_checkpoints/> <enable_vm_savestate_usage/> <minimum_checkpoint_interval>900</minimum_checkpoint_interval> <pf_guest_port>80</pf_guest_port> </vbox_job> |
©2024 CERN