Message boards : Number crunching : Issues Controlling Number of Threads Used
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile RueiKe

Send message
Joined: 28 Mar 16
Posts: 6
Credit: 26,244,624
RAC: 0
Message 42140 - Posted: 12 Apr 2020, 11:11:54 UTC

I am having trouble controlling the number of task running on my machines. In the past, I only had LHC download a limited number of tasks and run whenever available. So I had not experienced problems in the past few years. Now I am ramping up LHC on all of my systems as the primary project. My experience with loading has varied by system types:

    1) Intel 10core 20 thread, running windows and latest client loads as the options->computing-preferences-usage_limits specifies. No issues
    2) Two Threadripper 1950Xs (32 threads) in Windows, seems to max out loading (> number of threads available), ignoring options->computing-preferences-usage_limits. These are running 17.4.2 boincmgr.
    3) Two Threadripper 2990WXs (32 threads) in Linux, only run 16 threads each, no matter what I set options->computing-preferences-usage_limits to. I am using TBar's 7.8.3 build of boincmgr from previous SETI work.
    4) A 7702p on Linux seems to follow options->computing-preferences-usage_limits. It is also running TBar's build, same as my other 2 Linux systems.



Am I missing a configuration setting? Is there a known issue with older versions of boincmgr?

I am also working other issues: CMS tasks erroring out and Theory tasks erroring after several days. I am trying to now focus on just getting Atlas running and will troubleshoot the other later.

ID: 42140 · Report as offensive     Reply Quote
Greger

Send message
Joined: 9 Jan 15
Posts: 151
Credit: 431,596,822
RAC: 0
Message 42141 - Posted: 12 Apr 2020, 12:26:10 UTC - in response to Message 42140.  
Last modified: 12 Apr 2020, 12:36:27 UTC

I don't have any experience of this TBar but i would suggest to use recommended client from https://boinc.berkeley.edu/download.php even if it would work right now. Changes that is done in boinc manager would create an override setting that would be specifically use for that host and not listen to other changes. Check amount of disk and ram is needed as these task at LHC set work in vm environment using virtualbox or use cvmfs. It may not be problem on low amount of task running but when scale it up to run LHC only it would hit a limit on host with low disk space or low ram.


Each task can use from 1 to 8 cores according to what is set for "Max # CPUs" in the LHC@Home project preferences. By default this is 1, so if you want to run multiple core tasks you should change this setting. You can also change the number of cores to use and other setting by using an app_config.xml file (recommended for experienced volunteers only).

The events to process will be split among the available cores, so normally each core processes (no events in task) / (no of cores). The processes share memory which means multicore tasks use less total memory than running the same number of single core tasks. The memory allocated to the virtual machine is calculated based on the number of cores following the formula: 3GB + 0.9GB * ncores.

It is recommended to have more than 4GB memory available to run ATLAS tasks. Even single-core tasks are not practical to run with 4GB or less.

Console 3 shows the processes currently running. A healthy WU (after the initialisation phase) should have N athena.py processes using close to 100% CPU, where N is the number of cores. You will sometimes see an extra athena.py process, but this is the "master" process which controls the "child" processes doing the actual simulation.

Read here for Atlas https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4178#29560
Theory is more lightweight and require 630mb only but Atlas and CMS is more demanding.

So looking at host Your EPYC would perfect and 1950X (64GB) would do great. But 2990WX (32GB) would suffer on ram, this host could not run task to reach full usage on cpu.

The issue that experience for theory is normal. These task can run from few minutes up several days and could end with error as vm abort jobs that can't be finished in reasonable time. To extend time higher succes native application with cvmfs is suggested.
CMS complex and very heavy on system and issues are mostly related to network issues but could be other issues also. Follow forum daily to get info from project to minimize failures.
Follow error code from stderr log and check in forum if other users have solution. In some cases you would get more info for error if add extension to virtualbox and open session while task is running.
You open session from boinc manager and terminal would pop up. Use Alt+F2 to reach tty for session that does job. You reach top with Alt+F3.

You would be able to install cvmfs on these systems that run 18.04 and lower the need of disk space and ram on systems.

Downloads https://cernvm.cern.ch/portal/filesystem/downloads
Setup https://cernvm.cern.ch/portal/filesystem/quickstart
Setup config https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4758 (Recommended. Atlas would work by default with config from cern.ch but work better with this info from thread)

Edit: System also need squashtools and python. Singularity (container for atlas) and runc (container for theory) is not needed. Singularity is an option now and recommended to not use on system.
ID: 42141 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2411
Credit: 226,071,893
RAC: 126,759
Message 42142 - Posted: 12 Apr 2020, 12:54:38 UTC - in response to Message 42140.  

Some additions to what Gunde wrote.

1) Intel 10core 20 thread, running windows and latest client loads as the options->computing-preferences-usage_limits specifies. No issues

There is a major issue:
VBoxManage.exe: error: VT-x is disabled in the BIOS for all CPU modes (VERR_VMX_MSR_ALL_VMX_DISABLED)

Looks like another hypervisor is active beside VirtualBox.
You may go through Yeti's checklist to see what should be done.


2) Two Threadripper 1950Xs (32 threads) in Windows, seems to max out loading (> number of threads available), ignoring options->computing-preferences-usage_limits. These are running 17.4.2 boincmgr.

Both are running ATLAS tasks that succeed:
2020-04-12 18:28:11 (3080): Guest Log: HITS file was successfully produced

Theory tasks do also succeed.
2020-04-12 19:34:10 (11976): Guest Log: 19:34:08 CST +08:00 2020-04-12: cranky: [INFO] Container 'runc' finished with status code 0.



3) Two Threadripper 2990WXs (32 threads) in Linux, only run 16 threads each

Both are running ATLAS tasks that succeed but have not enough RAM to run more than 4 ATLAS tasks (vbox!) concurrently:
4 tasks, each configured to run on 4 cores (=6600 MB RAM) require 16 cores and 26400 MB.

CMS tasks erroring out

See:
https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=5339&postid=41865
and also the posts following that one.
ID: 42142 · Report as offensive     Reply Quote
Profile RueiKe

Send message
Joined: 28 Mar 16
Posts: 6
Credit: 26,244,624
RAC: 0
Message 42144 - Posted: 12 Apr 2020, 14:59:08 UTC

Thanks @computezrmle and @Gunde for the helpful feedback.

I was expecting a waiting for memory error on the 2990WX systems and assumed they were not attempting to run more than 16 threads for some other reason. It make sense that it limits the amount of threads based on memory. I wonder of the cost of memory has come down.. I will get a quote.

For the Intel system, I had fixed the virtualization setting in BIOS at the time of my post. The system had been running on LHC for a while, so I wrongly assumed it was already enabled. Once I enabled, I observed no issues.

Not sure why I saw one of the 1950X systems running more threads than the system has. Well actually the sum of the MT allocation for the Atlas tasks plus the other tasks was greater than number of threads, but maybe actually used was lower. When in this state, it would sporadically stop tasks and indicate it was waiting for memory. Seems ok now Perhaps an app_config.xml is needed here.

I did have an VM become corrupt on the Epyc system. Not sure why, but perhaps an upgrade to virtualbox is needed. What is the best version people are using on Ubuntu?3

Seems like it is tricky to make sure systems don't download CMS. I thought I had it so all of my systems would only download Atlas and Sixtrack, but one system is still downloading Theory.

One more question, if I set MT to 4 CPUs, will it select cores based on NUMA nodes?
ID: 42144 · Report as offensive     Reply Quote
Greger

Send message
Joined: 9 Jan 15
Posts: 151
Credit: 431,596,822
RAC: 0
Message 42146 - Posted: 12 Apr 2020, 16:10:42 UTC - in response to Message 42144.  

Not sure why I saw one of the 1950X systems running more threads than the system has. Well actually the sum of the MT allocation for the Atlas tasks plus the other tasks was greater than number of threads, but maybe actually used was lower. When in this state, it would sporadically stop tasks and indicate it was waiting for memory. Seems ok now Perhaps an app_config.xml is needed here.


Have experience this also with virtualbox that are MT. It would mostly correct itself after few minutes when vm machine start up and running jobs. In this stage it could break but due to fast switching on or off.
ONLY use app_config if sure you need a change anything to how boinc should handle application. It's been common use to limit ram or threads but not needed today. Default is mostly the best and effective enough. I use to 4 threads to atlas but settings in profile would be same. Only need today for me is set max task concurrently running when would mix contribution to other projects.

I did have an VM become corrupt on the Epyc system. Not sure why, but perhaps an upgrade to virtualbox is needed. What is the best version people are using on Ubuntu?


Your Epyc system looks great and (Version: 5.2.34) is solid enough for success it would not be better then it is now (Valid (804) · Invalid (0) · Error (1)). If update you would experience other issues.
My experience on big system is that later versions could handle many vm machines running concurrently so stay safe and don't update if works.
Both Linux and windows have suffer on this.

Boinc client would handle well in normal operations as long as doesn't get a bulk task starting and stopping. This could happen in your case when it correct itself in allocate threads or when boinc reach priority stage to hold deadline.
The higher mix off application/projects got different deadlines and resource share would cause some issues at some point but not always an issue. When i used older version of virtualbox in Ubuntu it could handle around 30 vm machines start/stop concurrently but above that system crashed and boinc-client got panic all task would get corrupted.

One more question, if I set MT to 4 CPUs, will it select cores based on NUMA nodes?


I have low knowledge on this and only tested in windows server 2016 to try to learn about NUMA and effect to boinc and failed get any understanding. I have not seen any way to be able to select by NUMA nodes or divide load on NUMA or UMA and only experiment settings for this. As i get it Boinc-client would only get systeminfo in total of cores or HT/SMT threads available and get RAM in total sum.
This could be more specific to system and how kernel work set load and microcode it has how handle it would in normal workload.
ID: 42146 · Report as offensive     Reply Quote

Message boards : Number crunching : Issues Controlling Number of Threads Used


©2024 CERN