Message boards : ATLAS application : ATLAS task at 100% for 10h
Message board moderation

To post messages, you must log in.

AuthorMessage
Der Martinator

Send message
Joined: 21 Dec 07
Posts: 3
Credit: 1,092,719
RAC: 0
Message 43549 - Posted: 2 Nov 2020, 7:37:37 UTC

Dear all,

My PC is running an ATLAS task with 8 cores, at the beginning it said it needs about 1,5h and now it is already running 34h and the percentage of completion encolsed more and more the 100% and now it is 100% since about 10h. Will it finish by itself or is there something wrong?
Meanwhile I updated also Boinc to 7.16.11 (with corresponding vbox)...

I am running an Intel i9 and Win10.

And somehow my CU load is curretnly at nearly 0% even that some tasks are running (Theory simulation)

anybody with hints?

Der Martinator
ID: 43549 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2401
Credit: 225,517,991
RAC: 124,188
Message 43550 - Posted: 2 Nov 2020, 9:14:12 UTC - in response to Message 43549.  

... at the beginning it said it needs about 1,5h and now it is already running 34h and the percentage of completion encolsed more and more the 100% and now it is 100% since about 10h...

These are fake numbers - as mentioned many times in this forum.

Guess it's this computer:
https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10659409

Since an 8-core setup is very inefficient (also explained many times) the suggestion would be to run a 4-core setup and limit it to 3 concurrently running tasks.


ToDos:
Set "max #CPUs" to 4 at your preferences page https://lhcathome.cern.ch/lhcathome/prefs.php?subset=project

Limit concurrency with an app_config.xml like this:
<app_config>
 <app>
  <name>ATLAS</name>
  <max_concurrent>3</max_concurrent>
 </app>
</app_config>


You may also consider to install the VirtualBox Extension Pack as it would allow you to access the ATLAS progress monitoring at ALT-F2.
ID: 43550 · Report as offensive     Reply Quote
Der Martinator

Send message
Joined: 21 Dec 07
Posts: 3
Credit: 1,092,719
RAC: 0
Message 43551 - Posted: 2 Nov 2020, 10:51:44 UTC - in response to Message 43550.  

Dear computezrmle,

Thanks for you info!
Yes, you are right, that computer is the PC we are talking about...
Well my CPU shall have 16 cores of which I am now using 8 and it would be kind of waste of resource if I limit it to only 3 cores...

I am currently using Boinc (and that also with boincstats) - shall I then set the "max #CPUs" at the LHC site? I would be sorry if my PC could only use 4 cores for LHC...

Another question - what to do with these 8-core tasks which are already assigned to my PC which seem to run quite inifite? Will they come to an end by themselves or should I abort them?

Thanks!
ID: 43551 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2401
Credit: 225,517,991
RAC: 124,188
Message 43552 - Posted: 2 Nov 2020, 11:12:44 UTC - in response to Message 43551.  

ATLAS is a multicore app.
Using a 4-core setup means that each task will use 4 (virtual) cores.
3 ATLAS tasks will then use 12 cores and leave 4 cores available for the OS and other stuff.
A 4th 4-core ATLAS task would then allocate all remaining cores.

Be aware that each ATLAS task uses 1 core during setup and phase-out and leaves all other cores allocated but idle.

In addition it's a question of RAM allocation.
2 8-core tasks will allocate 20400 MB RAM while 3 4-core tasks will allocate 19800 MB RAM.


Regarding the running 8-core task it's up to you whether to abort it or not.
ID: 43552 · Report as offensive     Reply Quote
Der Martinator

Send message
Joined: 21 Dec 07
Posts: 3
Credit: 1,092,719
RAC: 0
Message 43559 - Posted: 3 Nov 2020, 4:52:17 UTC - in response to Message 43552.  

Thanks for the explanation!
I set up Boinc to use maximum 50% of the cores, so 8 are resulting. 8 cores shall remain always free.
There had been a problem during the update of Boinc where VirtualBox was not properly installed so I installed it manually and at least now the other LHC tasks such as Theory simulation are doing fine.
I also installed the extension pack afterwards and then when having a look at the console of this ATLAS (8 core) task then there is just "CentOS and a linux login" - nothing more.
And my whole CPU is completely idle - only 1-3% for the Win10.

I presume, something is wrong with these tasks... All others are running well, loading the CPU, giving a console output and finishing in reasonable time...

Any idea?
ID: 43559 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2090
Credit: 158,874,648
RAC: 126,110
Message 43560 - Posted: 3 Nov 2020, 6:37:08 UTC
Last modified: 3 Nov 2020, 6:38:37 UTC

This checklist from Yeti help for troubleshooting:
https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4302#30658
Seeing no country or International for your Country Definition.
ID: 43560 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2401
Credit: 225,517,991
RAC: 124,188
Message 43561 - Posted: 3 Nov 2020, 13:21:18 UTC - in response to Message 43559.  

What you see is console 1 (the default).
Switching through the consoles can be done using the ALT key (press and hold) plus a function key.
ALT-F1 -> console 1
ALT-F2 -> console 2 (ATLAS monitoring if it's an ATLAS task; be patient as it will take a while until the setup completes and data becomes available)
ALT-F3 -> console 3 (output of the linux top command)
...



If you interrupt/suspend a VM task the command starts at your BOINC client and goes to the VM via the vboxwrapper.
VirtualBox then writes a snapshot of the complete VM RAM to disk (10200 MB in case of an 8-core ATLAS).
Vice versa when the task resumes.
In case of a shutdown all tasks must be written within 1 min or the BOINC client treats them as hanging and kills them regardless of their state. A reboot might use shorter timeouts defined by your OS.

This might have happened with your longrunner.
Your logfile shows typical error messages:
2020-11-02 07:59:22 (11028): Stopping VM.
2020-11-02 07:59:22 (11028): Error in stop VM for VM: -2147221164
Command:
VBoxManage -q controlvm "boinc_761f0d6aaa869b4f" savestate
Output:
VBoxManage.exe: error: Failed to create the VirtualBox object!
VBoxManage.exe: error: Code REGDB_E_CLASSNOTREG (0x80040154) - Class not registered (extended info not available)
VBoxManage.exe: error: Most likely, the VirtualBox COM server is not running or failed to start.

2020-11-02 07:59:22 (11028): VM did not stop when requested.
2020-11-02 07:59:22 (11028): VM was successfully terminated.



Next restart just 4 min later:
2020-11-02 08:03:15 (12936): Detected: vboxwrapper 26197
2020-11-02 08:03:15 (12936): Detected: BOINC client v7.7
2020-11-02 08:03:15 (12936): CreateProcess failed! (2).
2020-11-02 08:03:16 (12936): CreateProcess failed! (2).
2020-11-02 08:03:17 (12936): CreateProcess failed! (2).
2020-11-02 08:03:18 (12936): CreateProcess failed! (2).
2020-11-02 08:03:19 (12936): CreateProcess failed! (2).
2020-11-02 08:03:20 (12936): CreateProcess failed! (2).
2020-11-02 08:03:20 (12936): Error in version check for VM: -108



Same logfile:
2020-11-02 21:30:32 (11524): VM state change detected. (old = 'Running', new = 'Paused')
2020-11-02 21:30:33 (11524): VM state change detected. (old = 'Paused', new = 'Running')

I doubt your disk is fast enough to write 10 GB and reread the same amount of data within 1 second.

What to do?
It depends on what else is running on that computer.
General hints:
- try to find a setup that allows ATLAS to run with as few suspends as possible
- extend the period between task switches (BOINC client)
ID: 43561 · Report as offensive     Reply Quote

Message boards : ATLAS application : ATLAS task at 100% for 10h


©2024 CERN