Bad WUs?

Author	Message
Jim1348 Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0	Message 45813 - Posted: 8 Dec 2021, 16:09:22 UTC I haven't seen it yet on native ATLAS. https://lhcathome.cern.ch/lhcathome/results.php?hostid=10697859&offset=0&show_names=0&state=4&appid= ID: 45813 · Reply Quote

Yeti Volunteer moderator Send message Joined: 2 Sep 04 Posts: 468 Credit: 214,623,697 RAC: 48,316	Message 45814 - Posted: 8 Dec 2021, 16:44:45 UTC - in response to Message 45813. I haven't seen it yet on native ATLAS. https://lhcathome.cern.ch/lhcathome/results.php?hostid=10697859&offset=0&show_names=0&state=4&appid= It seems, as it damages the VirtualBox. I have seen two different problems: A) VMs running endless with less than 1% CPU-Usage B) VMs get suspended after 10/20/30/40 Seconds, they are "unmanagable". This spreads over all my systems and different VirtualBox-Versions. Today until now I had to abort 56 tasks Supporting BOINC, a great concept ! ID: 45814 · Reply Quote

Jim1348 Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0	Message 45815 - Posted: 8 Dec 2021, 16:48:26 UTC - in response to Message 45814. I have seen two different problems: A) VMs running endless with less than 1% CPU-Usage B) VMs get suspended after 10/20/30/40 Seconds, they are "unmanagable". This spreads over all my systems and different VirtualBox-Versions. I see both of them on the Rosetta python work units, which use VirtualBox. There is something very wrong with it, and I am surprised that Oracle has not figured it out. ID: 45815 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2276 Credit: 177,809,061 RAC: 95,800	Message 45816 - Posted: 8 Dec 2021, 16:50:00 UTC - in response to Message 45814. Last modified: 8 Dec 2021, 16:56:38 UTC PC with one CPU (Virtualbox 6.1.12) have no problems so long. All with faulty are using 2 CPU's (Virtualbox 6.1.30). ID: 45816 · Reply Quote

Jim1348 Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0	Message 45817 - Posted: 8 Dec 2021, 17:03:37 UTC - in response to Message 45816. PC with one CPU (Virtualbox 6.1.12) have no problems so long. All with faulty are using 2 CPU's (Virtualbox 6.1.30). That is interesting. My Rosetta machines have 24 or 32 CPUs (virtual cores). Someone needs to look into it. ID: 45817 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1906 Credit: 144,230,269 RAC: 72,965	Message 45818 - Posted: 8 Dec 2021, 17:19:04 UTC - in response to Message 45814. I have seen two different problems: A) VMs running endless with less than 1% CPU-Usage B) VMs get suspended after 10/20/30/40 Seconds, they are "unmanagable". This spreads over all my systems and different VirtualBox-Versions. so far, here I only saw version A). ID: 45818 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1906 Credit: 144,230,269 RAC: 72,965	Message 45819 - Posted: 8 Dec 2021, 17:40:50 UTC - in response to Message 45816. maeax wrote: PC with one CPU (Virtualbox 6.1.12) have no problems so long. All with faulty are using 2 CPU's (Virtualbox 6.1.30). so the question seems to be: is the problem connected to the VBox version or to the number of CPUs used ??? ID: 45819 · Reply Quote

Yeti Volunteer moderator Send message Joined: 2 Sep 04 Posts: 468 Credit: 214,623,697 RAC: 48,316	Message 45820 - Posted: 8 Dec 2021, 17:52:13 UTC - in response to Message 45819. maeax wrote: PC with one CPU (Virtualbox 6.1.12) have no problems so long. All with faulty are using 2 CPU's (Virtualbox 6.1.30). so the question seems to be: is the problem connected to the VBox version or to the number of CPUs used ??? For me it happens on VBOX 6.1.16 AND 6.1.30, they ran fine formerly for days (6.1.30) or month (6.1.16) And I used the same number of cores in the past and the same number of simultan running WUs Supporting BOINC, a great concept ! ID: 45820 · Reply Quote

Jim1348 Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0	Message 45821 - Posted: 8 Dec 2021, 18:02:48 UTC - in response to Message 45819. so the question seems to be: is the problem connected to the VBox version or to the number of CPUs used ??? Good question. I used to be able to fix it by going back to VBox 5.2.44. But that seems to no longer work. It is easy in Win10, but harder in Ubuntu, since Ubuntu 20.04.3 is not compatible with 5.2.44, only with 6.1.x. So I went back to Ubuntu 18.04.6 and VBox 5.2.44, but that still did not fix it on Rosetta pythons. I have noticed however that if I set BOINC to use only 50% of the CPUs, that it reduces the problem. That is almost like operating on full cores. Next, I am going to turn off virtual cores (not virtualization!) in the BIOS, and see if that fixes it. For my AMD motherboard, that is to disable symmetric multithreading (SMT) in the BIOS. Of course, you need to leave Virtual Machine Architecture (SVM) enabled. ID: 45821 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1906 Credit: 144,230,269 RAC: 72,965	Message 45822 - Posted: 8 Dec 2021, 18:23:58 UTC what I also notice: the current batch of ATLAS tasks use about 10% more RAM than the previous ones. ID: 45822 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2679 Credit: 286,693,798 RAC: 92,420	Message 45823 - Posted: 8 Dec 2021, 18:28:47 UTC Could anybody running one of the affected Windows computers try out the vboxwrapper that comes with CMS? It's just to find out whether this would solve the problem or not. I recently posted a comment about vboxwrapper at the forum of another project. It's not exactly the same issue but I think it's worth to try it out. Volunteers frequently affected by the postponed issue may try a different vboxwrapper. BOINC's wiki pages mention communication problems between vboxwrapper and VirtualBox 6.x, especially on Windows. They offer premade executables that may solve the problems: https://boinc.berkeley.edu/trac/wiki/VboxApps#Premadevboxwrapperexecutables It would be the job of the project developers to test those vboxwrappers and distribute them to the clients. As long as this is not done volunteers could use the following steps as a workaround: 1. Download an alternative vboxwrapper from the page mentioned above (or use one you got from another project, e.g. LHC@home) 2. Start the BOINC client but suspend computing 3. Change to the project directory, e.g. projects/www.cosmologyathome.org, and replace the vboxwrapper there with the test version; the filename must be the name of the old vboxwrapper 4. Resume computing -> check the logfiles of tasks started after the patch Each restart of the BOINC client will replace the patch with the original vboxwrapper from the project server. This can be avoided setting <dont_check_file_sizes>1</dont_check_file_sizes> in cc_config.xml, but then all other automatic updates will also not work ID: 45823 · Reply Quote

tullio Send message Joined: 19 Feb 08 Posts: 708 Credit: 4,336,250 RAC: 0	Message 45824 - Posted: 8 Dec 2021, 18:34:26 UTC Very little CPU usage on my VBox 6.1.30. I get a message "remote desktop not availablel". Tullio ID: 45824 · Reply Quote

Jim1348 Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0	Message 45825 - Posted: 8 Dec 2021, 18:42:09 UTC - in response to Message 45823. I recently posted a comment about vboxwrapper at the forum of another project. It's not exactly the same issue but I think it's worth to try it out. I tried it on the Rosetta pythons, though I had to use the vboxwrapper from LHC on my Ubuntu machine, since it appears that BOINC has it only for Windows. However, I got a "checksum" error, even though I had modified the cc_config.xml. So it seems that the wrapper must be compatible with the app. I didn't see a way to disable the checksum in cc_config.xml. ID: 45825 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2276 Credit: 177,809,061 RAC: 95,800	Message 45826 - Posted: 8 Dec 2021, 19:03:33 UTC - in response to Message 45825. Last modified: 8 Dec 2021, 19:13:24 UTC https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=177409989 2021-12-06 23:14:19 (17360): Guest Log: 00:00:10.010461 timesync vgsvcTimeSyncWorker: Radical guest time change: -3 588 312 006 000ns (GuestNow=1 638 828 858 953 428 000 ns GuestLast=1 638 832 447 265 434 000 ns fSetTimeLastLoop=true ) 2021-12-07 00:53:55 (17360): Status Report: Elapsed Time: '6000.000000' 2021-12-07 00:53:55 (17360): Status Report: CPU Time: '65.593750' 2021-12-07 02:34:00 (17360): Status Report: Elapsed Time: '12000.000000' 2021-12-07 02:34:00 (17360): Status Report: CPU Time: '92.265625' 2021-12-07 04:14:05 (17360): Status Report: Elapsed Time: '18000.000000' 2021-12-07 04:14:05 (17360): Status Report: CPU Time: '120.968750' 2021-12-07 05:54:11 (17360): Status Report: Elapsed Time: '24000.000000' 2021-12-07 05:54:11 (17360): Status Report: CPU Time: '148.703125' 2021-12-07 07:34:16 (17360): Status Report: Elapsed Time: '30000.000000' 2021-12-07 07:34:16 (17360): Status Report: CPU Time: '176.593750' 2021-12-07 09:14:21 (17360): Status Report: Elapsed Time: '36000.000000' 2021-12-07 09:14:21 (17360): Status Report: CPU Time: '205.218750' 2021-12-07 10:54:27 (17360): Status Report: Elapsed Time: '42000.000000' 2021-12-07 10:54:27 (17360): Status Report: CPU Time: '231.375000' 2021-12-07 12:23:16 (17360): Powering off VM. Same task was finished successful with CentOS from PRAGUELG2 with one CPU. 19:05:50 (64): wrapper (7.7.26015): starting 19:05:50 (64): wrapper: running run_atlas (--nthreads 1) [2021-12-07 19:05:50] Arguments: --nthreads 1 [2021-12-07 19:05:50] Threads: 1 [2021-12-08 04:34:32] -rw------- 1 boinc boinc 152504166 Dec 8 04:33 HITS.27537003._017275.pool.root.1 Seem a problem with the Cores more than ONE and NOT the vboxwrapper!! ID: 45826 · Reply Quote

Jim1348 Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0	Message 45827 - Posted: 8 Dec 2021, 19:13:49 UTC - in response to Message 45826. Seem a problem with the Cores more than ONE and NOT the vboxwrapper!! Good. I am glad there is a fix for it. But I would prefer that Oracle make their stuff compatible with virtual cores, so that we don't lose performance. Maybe it is not possible? ID: 45827 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2679 Credit: 286,693,798 RAC: 92,420	Message 45828 - Posted: 8 Dec 2021, 20:13:21 UTC I usually run ATLAS native singlecore but to test what happens I started an ATLAS native 4-core. Result: Something deeper in the ATLAS multicore scripts is broken! The task should write all task data to \slots\6 but is writing a couple of files to \slots\. This is a major error and needs urgent investigation by the developers! ID: 45828 · Reply Quote

David Cameron Project administrator Project developer Project scientist Send message Joined: 13 May 14 Posts: 387 Credit: 15,314,184 RAC: 0	Message 45829 - Posted: 8 Dec 2021, 21:37:46 UTC I am checking but nothing changed as far as I can see in the last few days in the set up of ATLAS tasks. My own native tasks seem to run ok. Could there be some Windows/Vbox update causing the problems? I can update the vboxwrapper version used by ATLAS if someone confirms that this fixes the problems. ID: 45829 · Reply Quote

Harri Liljeroos Send message Joined: 28 Sep 04 Posts: 780 Credit: 59,601,305 RAC: 44,607	Message 45830 - Posted: 8 Dec 2021, 22:02:05 UTC I see also the problems on win10 with vbox 5.2.44. The tasks are setup as 4 core tasks on web site but I run them with just using single core (setup via app_config.xml) ID: 45830 · Reply Quote

Jonathan Send message Joined: 25 Sep 17 Posts: 99 Credit: 3,425,566 RAC: 0	Message 45832 - Posted: 8 Dec 2021, 22:43:14 UTC Last modified: 8 Dec 2021, 22:48:28 UTC My 4 core task is behaving normally. I think I got the wrapper changed as log shows "2021-12-08 16:22:04 (11704): Detected: vboxwrapper 26202" It's the new 26203 misreporting then number, as usual. About 25 min in on work unit and I have all 4 athena.py running. Virtual consoles 2 and 3 look normal. ID: 45832 · Reply Quote

Jonathan Send message Joined: 25 Sep 17 Posts: 99 Credit: 3,425,566 RAC: 0	Message 45833 - Posted: 9 Dec 2021, 1:05:00 UTC Task just completed. I was able to run two, 4 cores tasks at once. 8 processor cores in use. I left SMT on so 8/16 in use. I only have 16Gb and it was almost all in use due to each task taking 6600Kb memory. Second task should finish up in about 45 more minutes but I don't see any problems. I only had LHC / Atlas running. No other projects or work. I will just let the machine continue and see if it gets any trouble work units. ID: 45833 · Reply Quote

LHC@home