Message boards : ATLAS application : Bad WUs?
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 . . . 8 · Next

AuthorMessage
Jim1348

Send message
Joined: 15 Nov 14
Posts: 602
Credit: 24,371,321
RAC: 0
Message 45813 - Posted: 8 Dec 2021, 16:09:22 UTC

ID: 45813 · Report as offensive     Reply Quote
Profile Yeti
Volunteer moderator
Avatar

Send message
Joined: 2 Sep 04
Posts: 453
Credit: 193,369,412
RAC: 10,065
Message 45814 - Posted: 8 Dec 2021, 16:44:45 UTC - in response to Message 45813.  

I haven't seen it yet on native ATLAS.
https://lhcathome.cern.ch/lhcathome/results.php?hostid=10697859&offset=0&show_names=0&state=4&appid=

It seems, as it damages the VirtualBox.

I have seen two different problems:

A) VMs running endless with less than 1% CPU-Usage
B) VMs get suspended after 10/20/30/40 Seconds, they are "unmanagable". This spreads over all my systems and different VirtualBox-Versions.

Today until now I had to abort 56 tasks


Supporting BOINC, a great concept !
ID: 45814 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Nov 14
Posts: 602
Credit: 24,371,321
RAC: 0
Message 45815 - Posted: 8 Dec 2021, 16:48:26 UTC - in response to Message 45814.  

I have seen two different problems:

A) VMs running endless with less than 1% CPU-Usage
B) VMs get suspended after 10/20/30/40 Seconds, they are "unmanagable". This spreads over all my systems and different VirtualBox-Versions.

I see both of them on the Rosetta python work units, which use VirtualBox.

There is something very wrong with it, and I am surprised that Oracle has not figured it out.
ID: 45815 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2071
Credit: 156,089,092
RAC: 103,852
Message 45816 - Posted: 8 Dec 2021, 16:50:00 UTC - in response to Message 45814.  
Last modified: 8 Dec 2021, 16:56:38 UTC

PC with one CPU (Virtualbox 6.1.12) have no problems so long.
All with faulty are using 2 CPU's (Virtualbox 6.1.30).
ID: 45816 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Nov 14
Posts: 602
Credit: 24,371,321
RAC: 0
Message 45817 - Posted: 8 Dec 2021, 17:03:37 UTC - in response to Message 45816.  

PC with one CPU (Virtualbox 6.1.12) have no problems so long.
All with faulty are using 2 CPU's (Virtualbox 6.1.30).

That is interesting. My Rosetta machines have 24 or 32 CPUs (virtual cores). Someone needs to look into it.
ID: 45817 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,350,846
RAC: 101,546
Message 45818 - Posted: 8 Dec 2021, 17:19:04 UTC - in response to Message 45814.  


I have seen two different problems:

A) VMs running endless with less than 1% CPU-Usage
B) VMs get suspended after 10/20/30/40 Seconds, they are "unmanagable". This spreads over all my systems and different VirtualBox-Versions.
so far, here I only saw version A).
ID: 45818 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,350,846
RAC: 101,546
Message 45819 - Posted: 8 Dec 2021, 17:40:50 UTC - in response to Message 45816.  

maeax wrote:
PC with one CPU (Virtualbox 6.1.12) have no problems so long.
All with faulty are using 2 CPU's (Virtualbox 6.1.30).
so the question seems to be: is the problem connected to the VBox version or to the number of CPUs used ???
ID: 45819 · Report as offensive     Reply Quote
Profile Yeti
Volunteer moderator
Avatar

Send message
Joined: 2 Sep 04
Posts: 453
Credit: 193,369,412
RAC: 10,065
Message 45820 - Posted: 8 Dec 2021, 17:52:13 UTC - in response to Message 45819.  

maeax wrote:
PC with one CPU (Virtualbox 6.1.12) have no problems so long.
All with faulty are using 2 CPU's (Virtualbox 6.1.30).
so the question seems to be: is the problem connected to the VBox version or to the number of CPUs used ???

For me it happens on VBOX 6.1.16 AND 6.1.30, they ran fine formerly for days (6.1.30) or month (6.1.16)

And I used the same number of cores in the past and the same number of simultan running WUs


Supporting BOINC, a great concept !
ID: 45820 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Nov 14
Posts: 602
Credit: 24,371,321
RAC: 0
Message 45821 - Posted: 8 Dec 2021, 18:02:48 UTC - in response to Message 45819.  

so the question seems to be: is the problem connected to the VBox version or to the number of CPUs used ???

Good question. I used to be able to fix it by going back to VBox 5.2.44. But that seems to no longer work.
It is easy in Win10, but harder in Ubuntu, since Ubuntu 20.04.3 is not compatible with 5.2.44, only with 6.1.x.
So I went back to Ubuntu 18.04.6 and VBox 5.2.44, but that still did not fix it on Rosetta pythons.

I have noticed however that if I set BOINC to use only 50% of the CPUs, that it reduces the problem. That is almost like operating on full cores.
Next, I am going to turn off virtual cores (not virtualization!) in the BIOS, and see if that fixes it.
For my AMD motherboard, that is to disable symmetric multithreading (SMT) in the BIOS.
Of course, you need to leave Virtual Machine Architecture (SVM) enabled.
ID: 45821 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,350,846
RAC: 101,546
Message 45822 - Posted: 8 Dec 2021, 18:23:58 UTC

what I also notice: the current batch of ATLAS tasks use about 10% more RAM than the previous ones.
ID: 45822 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,915,254
RAC: 138,113
Message 45823 - Posted: 8 Dec 2021, 18:28:47 UTC

Could anybody running one of the affected Windows computers try out the vboxwrapper that comes with CMS?
It's just to find out whether this would solve the problem or not.


I recently posted a comment about vboxwrapper at the forum of another project.
It's not exactly the same issue but I think it's worth to try it out.

Volunteers frequently affected by the postponed issue may try a different vboxwrapper.

BOINC's wiki pages mention communication problems between vboxwrapper and VirtualBox 6.x, especially on Windows.
They offer premade executables that may solve the problems:
https://boinc.berkeley.edu/trac/wiki/VboxApps#Premadevboxwrapperexecutables

It would be the job of the project developers to test those vboxwrappers and distribute them to the clients.
As long as this is not done volunteers could use the following steps as a workaround:

1. Download an alternative vboxwrapper from the page mentioned above (or use one you got from another project, e.g. LHC@home)
2. Start the BOINC client but suspend computing
3. Change to the project directory, e.g. projects/www.cosmologyathome.org, and replace the vboxwrapper there with the test version; the filename must be the name of the old vboxwrapper
4. Resume computing -> check the logfiles of tasks started after the patch


Each restart of the BOINC client will replace the patch with the original vboxwrapper from the project server.
This can be avoided setting <dont_check_file_sizes>1</dont_check_file_sizes> in cc_config.xml, but then all other automatic updates will also not work
ID: 45823 · Report as offensive     Reply Quote
tullio

Send message
Joined: 19 Feb 08
Posts: 708
Credit: 4,336,250
RAC: 0
Message 45824 - Posted: 8 Dec 2021, 18:34:26 UTC

Very little CPU usage on my VBox 6.1.30. I get a message "remote desktop not availablel".
Tullio
ID: 45824 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Nov 14
Posts: 602
Credit: 24,371,321
RAC: 0
Message 45825 - Posted: 8 Dec 2021, 18:42:09 UTC - in response to Message 45823.  

I recently posted a comment about vboxwrapper at the forum of another project.
It's not exactly the same issue but I think it's worth to try it out.

I tried it on the Rosetta pythons, though I had to use the vboxwrapper from LHC on my Ubuntu machine, since it appears that BOINC has it only for Windows.

However, I got a "checksum" error, even though I had modified the cc_config.xml. So it seems that the wrapper must be compatible with the app.
I didn't see a way to disable the checksum in cc_config.xml.
ID: 45825 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2071
Credit: 156,089,092
RAC: 103,852
Message 45826 - Posted: 8 Dec 2021, 19:03:33 UTC - in response to Message 45825.  
Last modified: 8 Dec 2021, 19:13:24 UTC

https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=177409989
2021-12-06 23:14:19 (17360): Guest Log: 00:00:10.010461 timesync vgsvcTimeSyncWorker: Radical guest time change: -3 588 312 006 000ns (GuestNow=1 638 828 858 953 428 000 ns GuestLast=1 638 832 447 265 434 000 ns fSetTimeLastLoop=true )

2021-12-07 00:53:55 (17360): Status Report: Elapsed Time: '6000.000000'
2021-12-07 00:53:55 (17360): Status Report: CPU Time: '65.593750'
2021-12-07 02:34:00 (17360): Status Report: Elapsed Time: '12000.000000'
2021-12-07 02:34:00 (17360): Status Report: CPU Time: '92.265625'
2021-12-07 04:14:05 (17360): Status Report: Elapsed Time: '18000.000000'
2021-12-07 04:14:05 (17360): Status Report: CPU Time: '120.968750'
2021-12-07 05:54:11 (17360): Status Report: Elapsed Time: '24000.000000'
2021-12-07 05:54:11 (17360): Status Report: CPU Time: '148.703125'
2021-12-07 07:34:16 (17360): Status Report: Elapsed Time: '30000.000000'
2021-12-07 07:34:16 (17360): Status Report: CPU Time: '176.593750'
2021-12-07 09:14:21 (17360): Status Report: Elapsed Time: '36000.000000'
2021-12-07 09:14:21 (17360): Status Report: CPU Time: '205.218750'
2021-12-07 10:54:27 (17360): Status Report: Elapsed Time: '42000.000000'
2021-12-07 10:54:27 (17360): Status Report: CPU Time: '231.375000'
2021-12-07 12:23:16 (17360): Powering off VM.

Same task was finished successful with CentOS from PRAGUELG2 with one CPU.
19:05:50 (64): wrapper (7.7.26015): starting
19:05:50 (64): wrapper: running run_atlas (--nthreads 1)
[2021-12-07 19:05:50] Arguments: --nthreads 1

[2021-12-07 19:05:50] Threads: 1
[2021-12-08 04:34:32] -rw------- 1 boinc boinc 152504166 Dec 8 04:33 HITS.27537003._017275.pool.root.1

Seem a problem with the Cores more than ONE and NOT the vboxwrapper!!
ID: 45826 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Nov 14
Posts: 602
Credit: 24,371,321
RAC: 0
Message 45827 - Posted: 8 Dec 2021, 19:13:49 UTC - in response to Message 45826.  

Seem a problem with the Cores more than ONE and NOT the vboxwrapper!!
Good. I am glad there is a fix for it.
But I would prefer that Oracle make their stuff compatible with virtual cores, so that we don't lose performance.
Maybe it is not possible?
ID: 45827 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,915,254
RAC: 138,113
Message 45828 - Posted: 8 Dec 2021, 20:13:21 UTC

I usually run ATLAS native singlecore but to test what happens I started an ATLAS native 4-core.

Result:
Something deeper in the ATLAS multicore scripts is broken!
The task should write all task data to \slots\6 but is writing a couple of files to \slots\.
This is a major error and needs urgent investigation by the developers!
ID: 45828 · Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project scientist

Send message
Joined: 13 May 14
Posts: 387
Credit: 15,314,184
RAC: 0
Message 45829 - Posted: 8 Dec 2021, 21:37:46 UTC

I am checking but nothing changed as far as I can see in the last few days in the set up of ATLAS tasks. My own native tasks seem to run ok.

Could there be some Windows/Vbox update causing the problems? I can update the vboxwrapper version used by ATLAS if someone confirms that this fixes the problems.
ID: 45829 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 674
Credit: 43,150,492
RAC: 15,942
Message 45830 - Posted: 8 Dec 2021, 22:02:05 UTC

I see also the problems on win10 with vbox 5.2.44. The tasks are setup as 4 core tasks on web site but I run them with just using single core (setup via app_config.xml)
ID: 45830 · Report as offensive     Reply Quote
Jonathan

Send message
Joined: 25 Sep 17
Posts: 93
Credit: 3,068,664
RAC: 1,808
Message 45832 - Posted: 8 Dec 2021, 22:43:14 UTC
Last modified: 8 Dec 2021, 22:48:28 UTC

My 4 core task is behaving normally.
I think I got the wrapper changed as log shows "2021-12-08 16:22:04 (11704): Detected: vboxwrapper 26202" It's the new 26203 misreporting then number, as usual.

About 25 min in on work unit and I have all 4 athena.py running. Virtual consoles 2 and 3 look normal.
ID: 45832 · Report as offensive     Reply Quote
Jonathan

Send message
Joined: 25 Sep 17
Posts: 93
Credit: 3,068,664
RAC: 1,808
Message 45833 - Posted: 9 Dec 2021, 1:05:00 UTC

Task just completed.
I was able to run two, 4 cores tasks at once. 8 processor cores in use. I left SMT on so 8/16 in use. I only have 16Gb and it was almost all in use due to each task taking 6600Kb memory.
Second task should finish up in about 45 more minutes but I don't see any problems. I only had LHC / Atlas running. No other projects or work.

I will just let the machine continue and see if it gets any trouble work units.
ID: 45833 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 . . . 8 · Next

Message boards : ATLAS application : Bad WUs?


©2024 CERN