61) Message boards : Number crunching : Max # jobs and Max # CPUs (Message 40709)
Posted 27 Nov 2019 by Profile Laurence
Post:
I have updated the scheduler. I hope it is an improvement over what we have now even though there is still a small issue. For those that are interested I have opened an issue in github.
62) Message boards : Number crunching : Max # jobs and Max # CPUs (Message 40682)
Posted 26 Nov 2019 by Profile Laurence
Post:
As far as I understand the scheduler code, the issue is that the project preferences setting is limiting the ncpus and that this value is used to set the number of threads. We probably don't want to touch ncpus and just set the number of threads. Will switch over to the dev project to test out some changes.Those that want to join can see me there.
63) Message boards : Number crunching : Max # jobs and Max # CPUs (Message 40681)
Posted 26 Nov 2019 by Profile Laurence
Post:
Run native if available?
is now in Atlas-Prefs. Do we need to activate it now?
Is the Beta-Pref obsolet?

I haven't enabled this for ATLAS yet, am waiting for the green light.

It is now enabled.
64) Message boards : Number crunching : Max # jobs and Max # CPUs (Message 40653)
Posted 25 Nov 2019 by Profile Laurence
Post:
Run native if available?
is now in Atlas-Prefs. Do we need to activate it now?
Is the Beta-Pref obsolet?

I haven't enabled this for ATLAS yet, am waiting for the green light.
65) Message boards : Number crunching : Max # jobs and Max # CPUs (Message 40651)
Posted 25 Nov 2019 by Profile Laurence
Post:
The Max # jobs and Max # CPUs settings in the project preferences were initially added so that we could limit the number jobs and CPUs used by a new volunteer by providing default value which could be changed later. The aim was to stop a machine maxing out on VM tasks and rendering the host unusable for anything else. The current behavior for a multi-threaded application is as follows (threads are CPUs for the VM apps):
Max 1 CPU, Max 1 Job => 1 single threaded job
Max 2 CPU, Max 1 Job => 2 threaded job 
Max 1 CPU, Max 2 Job => 1 single threaded job
Max 2 CPU, Max 2 Job => 2 x 2 threaded jobs

In practice Max CPUs is used to set the number of threads and hence CPUs to be used by a VM, hence Max 1 CPU, Max 2 Job => 1 single threaded job, does not function as expected. In this case it should run two single CPU jobs. As far as I understand we could remove the Max CPU setting and nthreads could be set for the app in the app_config.xml. The two reasons given for not doing this are:

  • As nthreads is set after the job has been sent out via the scheduler, this value is not taken into consideration when assigning credit
  • Prefer to set it via the Web page rather than editing XML


Since the recent changes to Theory, this mainly affects that ATLAS application as setting Max # CPU is only relevant for that application.

Further comments welcome.

66) Message boards : Theory Application : Move TheoryN Back Into Theory. (Message 40650)
Posted 25 Nov 2019 by Profile Laurence
Post:
I am going to move this discussion to the number crunching topic as it affects all apps.
67) Message boards : Theory Application : Move TheoryN Back Into Theory. (Message 40604)
Posted 22 Nov 2019 by Profile Laurence
Post:
One of the reasons why Max#CPUs has been introduced was to simplify the ATLAS multicore configuration for users who didn't want to deal with an app_config.xml

This leads to a wider discussion but essentially there are policies and the implementation of those polices. We should first understand the policy that we need, how it should be implemented, then how it can be implemented within the limitations of the existing code base.

When a workunit is sent to the client via scheduler reply it includes an <app_version> section and within this section <avg_ncpus> is set.
The client copies this <avg_ncpus> value to it's client_state.xml but overwrites it with the value from app_config.xml.
From now on the client as well as the vboxwrapper use the value that has been set last.

Unfortunately <avg_ncpus> is never reported back to the server via scheduler request.
It's in the user's responsibility to keep the values in sync.

If server and client are not in sync this affects credit calculation as well as the amount of work the server will send in future requests (at least until the FLOPS value is adjusted).

Thanks, I didn't appreciate this subtlety.


If Max#CPUs will be deactivated, every workunit will be treated as singlecore on the server, even ATLAS.
Not a problem for other apps since all of them are now singlecore but ATLAS may require a reworked default policy, e.g. a singlecore default combined with a new method to identify if the client runs tasks with n threads.

I think the current implementation is wrong. This sets max_cpus to be effective_ncpus but you are talking about avg_ncpus. Will dig a little more into the code.
68) Message boards : Theory Application : Move TheoryN Back Into Theory. (Message 40603)
Posted 22 Nov 2019 by Profile Laurence
Post:
My suggestion would be therefore to disable the Max # CPU functionality and control it with the app_config.xml. Comments?
There is one big BUT.
When using only single core from server perspective, BOINC credit will calculate and grant credit based on the elapsed time times reported GFLOPS.
When a user setup VM's as dual, quadcore whatever by using app_config.xml, the elapsed time will reduce and his credit will be significant lower.
A lot of crunchers will not appreciate that.
We know that only ATLAS real benefits from multi-core. As long as ATLAS tasks are rather equal one could change to fixed credit / task for ATLAS.

Thanks for pointing this out. I didn't appreciate the affect on the credit.
69) Message boards : Theory Application : Move TheoryN Back Into Theory. (Message 40602)
Posted 22 Nov 2019 by Profile Laurence
Post:
native Theory(300.02) and native-Atlas(2.73) get only ONE Task for me.
native Theory(1.01) got two tasks with allways ONE Cpu in use.
Had nothing changed in prefs or app_config.

This agrees with the configuration. Both ATLAS and Theory set a limit of one task per cpu and TheoryN had two.
70) Message boards : Theory Application : Move TheoryN Back Into Theory. (Message 40592)
Posted 22 Nov 2019 by Profile Laurence
Post:
This will affect all VBox apps. I will investigate how to disable Max # of CPUs for single threaded apps.

I believe the relevant line is here. I don't think we can just disable this as it is used for ATLAS and CMS to select the number of CPUs to use for a VM so there needs to be an AND statement with something where that something is essentially !Theory.


So thinking about this a bit more, I think the use of Max # CPUs is a mistake. For a start from the BOINC scheduling perspective this is threads. In the vboxwrapper, this parameter is interpreted as CPUs. Also in the current implementation this affects the whole project where you may want to defined the VM size by host and project. The best way to do this is in the app_config.xml on the client, for example:

<app_config>
   <app_version>
       <app_name>ATLAS</app_name>
       <plan_class>vbox64_mt_mcore_atlas</plan_class>
       <avg_ncpus>x</avg_ncpus>
   </app_version>
</app_config>

or
<app_config>
   <app_version>
       <app_name>ATLAS</app_name>
       <plan_class>vbox64_mt_mcore_atlas</plan_class>
       <cmdline>--nthreads 7</cmdline>
   </app_version>]
</app_config>


My suggestion would be therefore to disable the Max # CPU functionality and control it with the app_config.xml. Comments?
71) Message boards : Theory Application : Move TheoryN Back Into Theory. (Message 40588)
Posted 22 Nov 2019 by Profile Laurence
Post:
This will affect all VBox apps. I will investigate how to disable Max # of CPUs for single threaded apps.

I believe the relevant line is here. I don't think we can just disable this as it is used for ATLAS and CMS to select the number of CPUs to use for a VM so there needs to be an AND statement with something where that something is essentially !Theory.
72) Message boards : Theory Application : Move TheoryN Back Into Theory. (Message 40578)
Posted 21 Nov 2019 by Profile Laurence
Post:
I set Max jobs = 2 and Max CPUs 2 and ended up with two jobs each using 2 CPUs. Not sure this is what we want. The plan class currently contains:
    <min_ncpus>1</min_ncpus>
    <max_threads>2</max_threads>

As far as I understand the Theory app does use two threads but is there any advantage of giving two CPUs?

This is how limits are now working when requesting Theory's:

Max 1 task / thread
Max # of CPUs
Max # of jobs.

Since the tasks will run single core it's best to set 'No limit' for Max # of CPU's to avoid getting less tasks than you expect.
If you want less tasks than the number of threads set that lower value to Max # of tasks.

This will affect all VBox apps. I will investigate how to disable Max # of CPUs for single threaded apps.
73) Message boards : Theory Application : Estimated Remaining Time Well Past Scheduled Due Date (Message 40574)
Posted 21 Nov 2019 by Profile Laurence
Post:
Sherpa jobs have a reputation for being long runners but it looks from the log that it might be finished in 2 days. I have one too at the moment which is a bit annoying as I am testing things so might have to abort. it. Will leave others to comment who have more experience with watching them.
74) Message boards : Theory Application : Estimated Remaining Time Well Past Scheduled Due Date (Message 40571)
Posted 21 Nov 2019 by Profile Laurence
Post:
The first rig had 75 folders (most empty) in /slots/ and the first runRivet.log I found has 6,022 lines in it. I scrolled though them and see nothing that tells me anything.

Run tail -f on that file and check lines are being written, they are different and it looks like the program is moving forward. The first line of that file will say what job it is. Post the first line and the last 10 lines here.
75) Message boards : Theory Application : Move TheoryN Back Into Theory. (Message 40570)
Posted 21 Nov 2019 by Profile Laurence
Post:
I have removed the multi-threading values from the plan class. It should now always runs as single CPU.
76) Message boards : Theory Application : Move TheoryN Back Into Theory. (Message 40564)
Posted 21 Nov 2019 by Profile Laurence
Post:
The Max # CPUs is limiting the Max # of jobs.
Now max CPUs must be equal or higher than Max jobs, else you don't get the number of tasks you want or your buffer can hold.

E.g. Max # jobs 3
Max # CPUs 2

I only get 2 tasks. When I set Max # CPUs to 3 or higher, I get 3 tasks.

The Max # CPU's should have no influence on the number of tasks.

Btw: For Theory I would remove the Max # of CPUs and run only single core Theory-tasks.
At the moment higher number of cpus is only useful for the ATLAS-application.

I set Max jobs = 2 and Max CPUs 2 and ended up with two jobs each using 2 CPUs. Not sure this is what we want. The plan class currently contains:
    <min_ncpus>1</min_ncpus>
    <max_threads>2</max_threads>

As far as I understand the Theory app does use two threads but is there any advantage of giving two CPUs?
77) Message boards : Theory Application : Estimated Remaining Time Well Past Scheduled Due Date (Message 40563)
Posted 21 Nov 2019 by Profile Laurence
Post:
I have a dozen or so nT 1.01 WUs that are running over 2 days. The CPU usage is jumping around in the 40-60%. Will these ever converge on a solution or should I Abort them???


You can take a look at the runRivet.log in the slot directory to see what the job is doing.
78) Message boards : Theory Application : Move TheoryN Back Into Theory. (Message 40546)
Posted 20 Nov 2019 by Profile Laurence
Post:
Found this thread, is it this pref:
<max_wus_in_progress> N </max_wus_in_progress>
https://boinc.berkeley.edu/forum_thread.php?id=12588


This is in the config.xml and is for the whole project. It is currently set to 50.
79) Message boards : Theory Application : Move TheoryN Back Into Theory. (Message 40543)
Posted 19 Nov 2019 by Profile Laurence
Post:
Need to find out why njobs isn't as expected. It could be something trivial such as total_limit not being defined so defaulting to 1.

I think njobs is the number of tasks being returned so probably not what we are looking for. total_limit is now set to 10. Let's see how far we get and if we can understand what is going on.
80) Message boards : Theory Application : Move TheoryN Back Into Theory. (Message 40538)
Posted 19 Nov 2019 by Profile Laurence
Post:

It looks like <per_proc/> is only used as a closing tag (when present) and not with a number in between like yours <per_proc>1</per_proc>

The parse bool function suggests both will result to true. I tried the self-closing tag first before making it more explicit. This assignment also suggests it is working from the the log output I was getting:
[quota] Limits for Theory:
[quota] CPU: base 1 scaled 7 njobs 0

The base and scaled values seem correct for my host with 4 ncpus. Need to find out why njobs isn't as expected. It could be something trivial such as total_limit not being defined so defaulting to 1.


Previous 20 · Next 20


©2024 CERN