Message boards :
ATLAS application :
Bad WUs?
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 . . . 8 · Next
Author | Message |
---|---|
Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0 |
I haven't seen it yet on native ATLAS. https://lhcathome.cern.ch/lhcathome/results.php?hostid=10697859&offset=0&show_names=0&state=4&appid= |
Send message Joined: 2 Sep 04 Posts: 455 Credit: 200,206,925 RAC: 46,843 |
I haven't seen it yet on native ATLAS. It seems, as it damages the VirtualBox. I have seen two different problems: A) VMs running endless with less than 1% CPU-Usage B) VMs get suspended after 10/20/30/40 Seconds, they are "unmanagable". This spreads over all my systems and different VirtualBox-Versions. Today until now I had to abort 56 tasks Supporting BOINC, a great concept ! |
Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0 |
I have seen two different problems: I see both of them on the Rosetta python work units, which use VirtualBox. There is something very wrong with it, and I am surprised that Oracle has not figured it out. |
Send message Joined: 2 May 07 Posts: 2228 Credit: 173,798,559 RAC: 18,443 |
PC with one CPU (Virtualbox 6.1.12) have no problems so long. All with faulty are using 2 CPU's (Virtualbox 6.1.30). |
Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0 |
PC with one CPU (Virtualbox 6.1.12) have no problems so long. That is interesting. My Rosetta machines have 24 or 32 CPUs (virtual cores). Someone needs to look into it. |
Send message Joined: 18 Dec 15 Posts: 1785 Credit: 117,278,447 RAC: 71,589 |
so far, here I only saw version A). |
Send message Joined: 18 Dec 15 Posts: 1785 Credit: 117,278,447 RAC: 71,589 |
maeax wrote: PC with one CPU (Virtualbox 6.1.12) have no problems so long.so the question seems to be: is the problem connected to the VBox version or to the number of CPUs used ??? |
Send message Joined: 2 Sep 04 Posts: 455 Credit: 200,206,925 RAC: 46,843 |
maeax wrote: For me it happens on VBOX 6.1.16 AND 6.1.30, they ran fine formerly for days (6.1.30) or month (6.1.16) And I used the same number of cores in the past and the same number of simultan running WUs Supporting BOINC, a great concept ! |
Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0 |
so the question seems to be: is the problem connected to the VBox version or to the number of CPUs used ??? Good question. I used to be able to fix it by going back to VBox 5.2.44. But that seems to no longer work. It is easy in Win10, but harder in Ubuntu, since Ubuntu 20.04.3 is not compatible with 5.2.44, only with 6.1.x. So I went back to Ubuntu 18.04.6 and VBox 5.2.44, but that still did not fix it on Rosetta pythons. I have noticed however that if I set BOINC to use only 50% of the CPUs, that it reduces the problem. That is almost like operating on full cores. Next, I am going to turn off virtual cores (not virtualization!) in the BIOS, and see if that fixes it. For my AMD motherboard, that is to disable symmetric multithreading (SMT) in the BIOS. Of course, you need to leave Virtual Machine Architecture (SVM) enabled. |
Send message Joined: 18 Dec 15 Posts: 1785 Credit: 117,278,447 RAC: 71,589 |
what I also notice: the current batch of ATLAS tasks use about 10% more RAM than the previous ones. |
Send message Joined: 15 Jun 08 Posts: 2520 Credit: 251,915,653 RAC: 128,265 |
Could anybody running one of the affected Windows computers try out the vboxwrapper that comes with CMS? It's just to find out whether this would solve the problem or not. I recently posted a comment about vboxwrapper at the forum of another project. It's not exactly the same issue but I think it's worth to try it out. Volunteers frequently affected by the postponed issue may try a different vboxwrapper. |
Send message Joined: 19 Feb 08 Posts: 708 Credit: 4,336,250 RAC: 0 |
Very little CPU usage on my VBox 6.1.30. I get a message "remote desktop not availablel". Tullio |
Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0 |
I recently posted a comment about vboxwrapper at the forum of another project. I tried it on the Rosetta pythons, though I had to use the vboxwrapper from LHC on my Ubuntu machine, since it appears that BOINC has it only for Windows. However, I got a "checksum" error, even though I had modified the cc_config.xml. So it seems that the wrapper must be compatible with the app. I didn't see a way to disable the checksum in cc_config.xml. |
Send message Joined: 2 May 07 Posts: 2228 Credit: 173,798,559 RAC: 18,443 |
https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=177409989 2021-12-06 23:14:19 (17360): Guest Log: 00:00:10.010461 timesync vgsvcTimeSyncWorker: Radical guest time change: -3 588 312 006 000ns (GuestNow=1 638 828 858 953 428 000 ns GuestLast=1 638 832 447 265 434 000 ns fSetTimeLastLoop=true ) 2021-12-07 00:53:55 (17360): Status Report: Elapsed Time: '6000.000000' 2021-12-07 00:53:55 (17360): Status Report: CPU Time: '65.593750' 2021-12-07 02:34:00 (17360): Status Report: Elapsed Time: '12000.000000' 2021-12-07 02:34:00 (17360): Status Report: CPU Time: '92.265625' 2021-12-07 04:14:05 (17360): Status Report: Elapsed Time: '18000.000000' 2021-12-07 04:14:05 (17360): Status Report: CPU Time: '120.968750' 2021-12-07 05:54:11 (17360): Status Report: Elapsed Time: '24000.000000' 2021-12-07 05:54:11 (17360): Status Report: CPU Time: '148.703125' 2021-12-07 07:34:16 (17360): Status Report: Elapsed Time: '30000.000000' 2021-12-07 07:34:16 (17360): Status Report: CPU Time: '176.593750' 2021-12-07 09:14:21 (17360): Status Report: Elapsed Time: '36000.000000' 2021-12-07 09:14:21 (17360): Status Report: CPU Time: '205.218750' 2021-12-07 10:54:27 (17360): Status Report: Elapsed Time: '42000.000000' 2021-12-07 10:54:27 (17360): Status Report: CPU Time: '231.375000' 2021-12-07 12:23:16 (17360): Powering off VM. Same task was finished successful with CentOS from PRAGUELG2 with one CPU. 19:05:50 (64): wrapper (7.7.26015): starting 19:05:50 (64): wrapper: running run_atlas (--nthreads 1) [2021-12-07 19:05:50] Arguments: --nthreads 1 [2021-12-07 19:05:50] Threads: 1 [2021-12-08 04:34:32] -rw------- 1 boinc boinc 152504166 Dec 8 04:33 HITS.27537003._017275.pool.root.1 Seem a problem with the Cores more than ONE and NOT the vboxwrapper!! |
Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0 |
Seem a problem with the Cores more than ONE and NOT the vboxwrapper!!Good. I am glad there is a fix for it. But I would prefer that Oracle make their stuff compatible with virtual cores, so that we don't lose performance. Maybe it is not possible? |
Send message Joined: 15 Jun 08 Posts: 2520 Credit: 251,915,653 RAC: 128,265 |
I usually run ATLAS native singlecore but to test what happens I started an ATLAS native 4-core. Result: Something deeper in the ATLAS multicore scripts is broken! The task should write all task data to \slots\6 but is writing a couple of files to \slots\. This is a major error and needs urgent investigation by the developers! |
Send message Joined: 13 May 14 Posts: 387 Credit: 15,314,184 RAC: 0 |
I am checking but nothing changed as far as I can see in the last few days in the set up of ATLAS tasks. My own native tasks seem to run ok. Could there be some Windows/Vbox update causing the problems? I can update the vboxwrapper version used by ATLAS if someone confirms that this fixes the problems. |
Send message Joined: 28 Sep 04 Posts: 722 Credit: 48,342,058 RAC: 29,814 |
I see also the problems on win10 with vbox 5.2.44. The tasks are setup as 4 core tasks on web site but I run them with just using single core (setup via app_config.xml) |
Send message Joined: 25 Sep 17 Posts: 99 Credit: 3,425,566 RAC: 0 |
My 4 core task is behaving normally. I think I got the wrapper changed as log shows "2021-12-08 16:22:04 (11704): Detected: vboxwrapper 26202" It's the new 26203 misreporting then number, as usual. About 25 min in on work unit and I have all 4 athena.py running. Virtual consoles 2 and 3 look normal. |
Send message Joined: 25 Sep 17 Posts: 99 Credit: 3,425,566 RAC: 0 |
Task just completed. I was able to run two, 4 cores tasks at once. 8 processor cores in use. I left SMT on so 8/16 in use. I only have 16Gb and it was almost all in use due to each task taking 6600Kb memory. Second task should finish up in about 45 more minutes but I don't see any problems. I only had LHC / Atlas running. No other projects or work. I will just let the machine continue and see if it gets any trouble work units. |
©2024 CERN