61)
Message boards :
Number crunching :
Max # jobs and Max # CPUs
(Message 40709)
Posted 27 Nov 2019 by Laurence Post: I have updated the scheduler. I hope it is an improvement over what we have now even though there is still a small issue. For those that are interested I have opened an issue in github. |
62)
Message boards :
Number crunching :
Max # jobs and Max # CPUs
(Message 40682)
Posted 26 Nov 2019 by Laurence Post: As far as I understand the scheduler code, the issue is that the project preferences setting is limiting the ncpus and that this value is used to set the number of threads. We probably don't want to touch ncpus and just set the number of threads. Will switch over to the dev project to test out some changes.Those that want to join can see me there. |
63)
Message boards :
Number crunching :
Max # jobs and Max # CPUs
(Message 40681)
Posted 26 Nov 2019 by Laurence Post: Run native if available? It is now enabled. |
64)
Message boards :
Number crunching :
Max # jobs and Max # CPUs
(Message 40653)
Posted 25 Nov 2019 by Laurence Post: Run native if available? I haven't enabled this for ATLAS yet, am waiting for the green light. |
65)
Message boards :
Number crunching :
Max # jobs and Max # CPUs
(Message 40651)
Posted 25 Nov 2019 by Laurence Post: The Max # jobs and Max # CPUs settings in the project preferences were initially added so that we could limit the number jobs and CPUs used by a new volunteer by providing default value which could be changed later. The aim was to stop a machine maxing out on VM tasks and rendering the host unusable for anything else. The current behavior for a multi-threaded application is as follows (threads are CPUs for the VM apps): Max 1 CPU, Max 1 Job => 1 single threaded job Max 2 CPU, Max 1 Job => 2 threaded job Max 1 CPU, Max 2 Job => 1 single threaded job Max 2 CPU, Max 2 Job => 2 x 2 threaded jobs In practice Max CPUs is used to set the number of threads and hence CPUs to be used by a VM, hence Max 1 CPU, Max 2 Job => 1 single threaded job, does not function as expected. In this case it should run two single CPU jobs. As far as I understand we could remove the Max CPU setting and nthreads could be set for the app in the app_config.xml. The two reasons given for not doing this are:
|
66)
Message boards :
Theory Application :
Move TheoryN Back Into Theory.
(Message 40650)
Posted 25 Nov 2019 by Laurence Post: I am going to move this discussion to the number crunching topic as it affects all apps. |
67)
Message boards :
Theory Application :
Move TheoryN Back Into Theory.
(Message 40604)
Posted 22 Nov 2019 by Laurence Post: One of the reasons why Max#CPUs has been introduced was to simplify the ATLAS multicore configuration for users who didn't want to deal with an app_config.xml This leads to a wider discussion but essentially there are policies and the implementation of those polices. We should first understand the policy that we need, how it should be implemented, then how it can be implemented within the limitations of the existing code base.
Thanks, I didn't appreciate this subtlety.
I think the current implementation is wrong. This sets max_cpus to be effective_ncpus but you are talking about avg_ncpus. Will dig a little more into the code. |
68)
Message boards :
Theory Application :
Move TheoryN Back Into Theory.
(Message 40603)
Posted 22 Nov 2019 by Laurence Post: My suggestion would be therefore to disable the Max # CPU functionality and control it with the app_config.xml. Comments?There is one big BUT. Thanks for pointing this out. I didn't appreciate the affect on the credit. |
69)
Message boards :
Theory Application :
Move TheoryN Back Into Theory.
(Message 40602)
Posted 22 Nov 2019 by Laurence Post: native Theory(300.02) and native-Atlas(2.73) get only ONE Task for me. This agrees with the configuration. Both ATLAS and Theory set a limit of one task per cpu and TheoryN had two. |
70)
Message boards :
Theory Application :
Move TheoryN Back Into Theory.
(Message 40592)
Posted 22 Nov 2019 by Laurence Post: This will affect all VBox apps. I will investigate how to disable Max # of CPUs for single threaded apps. So thinking about this a bit more, I think the use of Max # CPUs is a mistake. For a start from the BOINC scheduling perspective this is threads. In the vboxwrapper, this parameter is interpreted as CPUs. Also in the current implementation this affects the whole project where you may want to defined the VM size by host and project. The best way to do this is in the app_config.xml on the client, for example: <app_config> <app_version> <app_name>ATLAS</app_name> <plan_class>vbox64_mt_mcore_atlas</plan_class> <avg_ncpus>x</avg_ncpus> </app_version> </app_config> or <app_config> <app_version> <app_name>ATLAS</app_name> <plan_class>vbox64_mt_mcore_atlas</plan_class> <cmdline>--nthreads 7</cmdline> </app_version>] </app_config> My suggestion would be therefore to disable the Max # CPU functionality and control it with the app_config.xml. Comments? |
71)
Message boards :
Theory Application :
Move TheoryN Back Into Theory.
(Message 40588)
Posted 22 Nov 2019 by Laurence Post: This will affect all VBox apps. I will investigate how to disable Max # of CPUs for single threaded apps. I believe the relevant line is here. I don't think we can just disable this as it is used for ATLAS and CMS to select the number of CPUs to use for a VM so there needs to be an AND statement with something where that something is essentially !Theory. |
72)
Message boards :
Theory Application :
Move TheoryN Back Into Theory.
(Message 40578)
Posted 21 Nov 2019 by Laurence Post: I set Max jobs = 2 and Max CPUs 2 and ended up with two jobs each using 2 CPUs. Not sure this is what we want. The plan class currently contains: This will affect all VBox apps. I will investigate how to disable Max # of CPUs for single threaded apps. |
73)
Message boards :
Theory Application :
Estimated Remaining Time Well Past Scheduled Due Date
(Message 40574)
Posted 21 Nov 2019 by Laurence Post: Sherpa jobs have a reputation for being long runners but it looks from the log that it might be finished in 2 days. I have one too at the moment which is a bit annoying as I am testing things so might have to abort. it. Will leave others to comment who have more experience with watching them. |
74)
Message boards :
Theory Application :
Estimated Remaining Time Well Past Scheduled Due Date
(Message 40571)
Posted 21 Nov 2019 by Laurence Post: The first rig had 75 folders (most empty) in /slots/ and the first runRivet.log I found has 6,022 lines in it. I scrolled though them and see nothing that tells me anything. Run tail -f on that file and check lines are being written, they are different and it looks like the program is moving forward. The first line of that file will say what job it is. Post the first line and the last 10 lines here. |
75)
Message boards :
Theory Application :
Move TheoryN Back Into Theory.
(Message 40570)
Posted 21 Nov 2019 by Laurence Post: I have removed the multi-threading values from the plan class. It should now always runs as single CPU. |
76)
Message boards :
Theory Application :
Move TheoryN Back Into Theory.
(Message 40564)
Posted 21 Nov 2019 by Laurence Post: The Max # CPUs is limiting the Max # of jobs. I set Max jobs = 2 and Max CPUs 2 and ended up with two jobs each using 2 CPUs. Not sure this is what we want. The plan class currently contains: <min_ncpus>1</min_ncpus> <max_threads>2</max_threads> As far as I understand the Theory app does use two threads but is there any advantage of giving two CPUs? |
77)
Message boards :
Theory Application :
Estimated Remaining Time Well Past Scheduled Due Date
(Message 40563)
Posted 21 Nov 2019 by Laurence Post: I have a dozen or so nT 1.01 WUs that are running over 2 days. The CPU usage is jumping around in the 40-60%. Will these ever converge on a solution or should I Abort them??? You can take a look at the runRivet.log in the slot directory to see what the job is doing. |
78)
Message boards :
Theory Application :
Move TheoryN Back Into Theory.
(Message 40546)
Posted 20 Nov 2019 by Laurence Post: Found this thread, is it this pref: This is in the config.xml and is for the whole project. It is currently set to 50. |
79)
Message boards :
Theory Application :
Move TheoryN Back Into Theory.
(Message 40543)
Posted 19 Nov 2019 by Laurence Post: Need to find out why njobs isn't as expected. It could be something trivial such as total_limit not being defined so defaulting to 1. I think njobs is the number of tasks being returned so probably not what we are looking for. total_limit is now set to 10. Let's see how far we get and if we can understand what is going on. |
80)
Message boards :
Theory Application :
Move TheoryN Back Into Theory.
(Message 40538)
Posted 19 Nov 2019 by Laurence Post:
The parse bool function suggests both will result to true. I tried the self-closing tag first before making it more explicit. This assignment also suggests it is working from the the log output I was getting: [quota] Limits for Theory: [quota] CPU: base 1 scaled 7 njobs 0 The base and scaled values seem correct for my host with 4 ncpus. Need to find out why njobs isn't as expected. It could be something trivial such as total_limit not being defined so defaulting to 1. |
©2024 CERN