Message boards : ATLAS application : issues with app config/running multiple tasks
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
BRG

Send message
Joined: 23 Dec 16
Posts: 26
Credit: 776,007
RAC: 0
Message 30066 - Posted: 26 Apr 2017, 20:24:41 UTC

Evening all,

Last few days Bionic has only allowed me to run a couple of atlas tasks at once rather than the max set of 6 but normally 4 due to ram...

I have everything set to use 100% (cpu and ram) within Bionic, checked settings on the lhc side of things too that's all at max, jobs set to no limit, cpus I have tried from no limit to 24, now at 24 and its only allowing one task.

24 cores and 32g ram

app config:

<?xml version="1.0"?>

-<app_config>


-<app>

<name>ATLAS</name>

<max_concurrent>6</max_concurrent>

</app>


-<app_version>

<app_name>ATLAS</app_name>

<avg_ncpus>2.000000</avg_ncpus>

<plan_class>vbox64_mt_mcore_atlas</plan_class>

<cmdline>--memory_size_mb 4800</cmdline>

</app_version>

</app_config>
ID: 30066 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1268
Credit: 8,421,616
RAC: 2,139
Message 30067 - Posted: 26 Apr 2017, 20:47:47 UTC - in response to Message 30066.  

Set in your preferences the # of CPU's also to 2 when you have 2 in your app_config.xml
ID: 30067 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,893,032
RAC: 138,165
Message 30068 - Posted: 26 Apr 2017, 21:20:51 UTC - in response to Message 30066.  

Your hosts are hidden.
Expert users can´t check your logs.
You may change your preferences and make your hosts visible.

Your app_config.xml looks strange.
Is it due to the copy/paste or are there really lines like:
<?xml version="1.0"?>

-<app_config>


-<app>



Your setting
<avg_ncpus>2.000000</avg_ncpus>

overrules the website preference "Max # of CPUs = 24" except the server´s working set size calculation which is now 9000MB per WU.
Reduce the website preferences to not more than the value that you use in your app_config.xml.

A 24 core host would be able to run 3 8-core WUs (3x9000MB = 27000MB).
If you configure 4-core WUs 5800MB would be required per WU.
This would use 20 CPUs.
ID: 30068 · Report as offensive     Reply Quote
BRG

Send message
Joined: 23 Dec 16
Posts: 26
Credit: 776,007
RAC: 0
Message 30070 - Posted: 26 Apr 2017, 21:58:33 UTC - in response to Message 30068.  
Last modified: 26 Apr 2017, 21:58:53 UTC

Set in your preferences the # of CPU's also to 2 when you have 2 in your app_config.xml


Now done :) thanks


Your hosts are hidden.
Expert users can´t check your logs.
You may change your preferences and make your hosts visible.

Your app_config.xml looks strange.
Is it due to the copy/paste or are there really lines like:
<?xml version="1.0"?>

-<app_config>


-<app>



Your setting
<avg_ncpus>2.000000</avg_ncpus>

overrules the website preference "Max # of CPUs = 24" except the server´s working set size calculation which is now 9000MB per WU.
Reduce the website preferences to not more than the value that you use in your app_config.xml.

A 24 core host would be able to run 3 8-core WUs (3x9000MB = 27000MB).
If you configure 4-core WUs 5800MB would be required per WU.
This would use 20 CPUs.


I will allow computers to show now :)

it maybe due to the copy and paste... hmmm, this is via edit:

<app_config>
<app>
<name>ATLAS</name>
<max_concurrent>6</max_concurrent>
</app>
<app_version>
<app_name>ATLAS</app_name>
<avg_ncpus>2.000000</avg_ncpus>
<plan_class>vbox64_mt_mcore_atlas</plan_class>
<cmdline>--memory_size_mb 4800</cmdline>
</app_version>
</app_config>

What do you advise I do? I was told 2 cores per workunit?!?! and using the config setting above.
ID: 30070 · Report as offensive     Reply Quote
BRG

Send message
Joined: 23 Dec 16
Posts: 26
Credit: 776,007
RAC: 0
Message 30071 - Posted: 26 Apr 2017, 22:11:08 UTC

Deleted app data file and still no change... closed and opened bionic etc
ID: 30071 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1268
Credit: 8,421,616
RAC: 2,139
Message 30073 - Posted: 27 Apr 2017, 6:07:18 UTC - in response to Message 30071.  

Deleted app data file and still no change... closed and opened bionic etc

I suppose you still have tasks in queue, you already got before your changes.
ID: 30073 · Report as offensive     Reply Quote
BRG

Send message
Joined: 23 Dec 16
Posts: 26
Credit: 776,007
RAC: 0
Message 30077 - Posted: 27 Apr 2017, 9:29:38 UTC - in response to Message 30073.  

Deleted app data file and still no change... closed and opened bionic etc

I suppose you still have tasks in queue, you already got before your changes.


I did delete them, however I left the ones that where running to run, back in work this morning and still only one running and the others saying waiting for memory
ID: 30077 · Report as offensive     Reply Quote
BRG

Send message
Joined: 23 Dec 16
Posts: 26
Credit: 776,007
RAC: 0
Message 30078 - Posted: 27 Apr 2017, 9:51:54 UTC - in response to Message 30077.  

So, just finished 1 task and deleted 4... removed the app config file and just downloaded 2 WU's, both now running for 1 minute, before it would do seconds then stop... without jumping to conclusions it must be an app data file error?!?!?!
ID: 30078 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,893,032
RAC: 138,165
Message 30079 - Posted: 27 Apr 2017, 10:32:38 UTC

Have you ever worked through Yeti´s checklist?
Fine.

Beside that you may restart the project with conservative settings.

1. Let your local WU cache get empty
2. Reset the project in BOINC
3. Update your VirtualBox software to the most recent version
4. Reboot your host
5. Set "Max # jobs = 1" and "Max # CPUs = 1" on the LHC website
6. Create the following app_config.xml

<app_config>
  <app>
    <name>ATLAS</name>
    <max_concurrent>1</max_concurrent>
    <fraction_done_exact/>
  </app>
  <app_version>
    <app_name>ATLAS</app_name>
    <plan_class>vbox64_mt_mcore_atlas</plan_class>
    <avg_ncpus>1.0</avg_ncpus>
    <cmdline>--memory_size_mb 5000</cmdline>
  </app_version>
  <project_max_concurrent>1</project_max_concurrent>
</app_config> 


7. Request a new WU from the project
8. Reload your configuration (must be done after you got the first WU and before the WU starts)
9. Check the result before you change your settings and request a new WU
ID: 30079 · Report as offensive     Reply Quote
BRG

Send message
Joined: 23 Dec 16
Posts: 26
Credit: 776,007
RAC: 0
Message 30085 - Posted: 27 Apr 2017, 14:00:16 UTC - in response to Message 30079.  

Have you ever worked through Yeti´s checklist?
Fine.

Beside that you may restart the project with conservative settings.

1. Let your local WU cache get empty
2. Reset the project in BOINC
3. Update your VirtualBox software to the most recent version
4. Reboot your host
5. Set "Max # jobs = 1" and "Max # CPUs = 1" on the LHC website
6. Create the following app_config.xml

<app_config>
  <app>
    <name>ATLAS</name>
    <max_concurrent>1</max_concurrent>
    <fraction_done_exact/>
  </app>
  <app_version>
    <app_name>ATLAS</app_name>
    <plan_class>vbox64_mt_mcore_atlas</plan_class>
    <avg_ncpus>1.0</avg_ncpus>
    <cmdline>--memory_size_mb 5000</cmdline>
  </app_version>
  <project_max_concurrent>1</project_max_concurrent>
</app_config> 


7. Request a new WU from the project
8. Reload your configuration (must be done after you got the first WU and before the WU starts)
9. Check the result before you change your settings and request a new WU


I did go through his checklist last night, it was his check list that made me check preferences within lhc computing preferences :)

I have set it to not allow more tasks, will complete these 2 task... follow your list and then post back. thanks :)
ID: 30085 · Report as offensive     Reply Quote
BRG

Send message
Joined: 23 Dec 16
Posts: 26
Credit: 776,007
RAC: 0
Message 30088 - Posted: 27 Apr 2017, 15:29:47 UTC

LHC@home: Notice from BOINC
Your app_config.xml file refers to an unknown application 'ATLAS'. Known applications: None
27/04/2017 3:59:09 PM


Had this come up, however its gone now...

Will run this task through, pause everything and post again.
ID: 30088 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,338,656
RAC: 101,934
Message 30089 - Posted: 27 Apr 2017, 16:14:42 UTC - in response to Message 30088.  

Your app_config.xml file refers to an unknown application 'ATLAS'. ...

Had this come up, however its gone now...

well, BOINC shows this notice only once, when you go to "Options" - "read config files".

When you repeat doing this, and the notice shows up again, then something is going wrong.
ID: 30089 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,893,032
RAC: 138,165
Message 30090 - Posted: 27 Apr 2017, 16:20:36 UTC

This happens after every project reset until ATLAS (in this case) is known to your host through the first server response.
Nothing to worry about if you managed to load the app_config.xml before BOINC started the WU.
See number 8 of my list.

You may check the stderr.txt in the slots dir of the running WU.
If "Setting Memory Size for VM. (xxxxMB)" corresponds to your app_config.xml everything is fine.
ID: 30090 · Report as offensive     Reply Quote
BRG

Send message
Joined: 23 Dec 16
Posts: 26
Credit: 776,007
RAC: 0
Message 30091 - Posted: 27 Apr 2017, 16:42:13 UTC - in response to Message 30090.  
Last modified: 27 Apr 2017, 16:46:56 UTC

This happens after every project reset until ATLAS (in this case) is known to your host through the first server response.
Nothing to worry about if you managed to load the app_config.xml before BOINC started the WU.
See number 8 of my list.

You may check the stderr.txt in the slots dir of the running WU.
If "Setting Memory Size for VM. (xxxxMB)" corresponds to your app_config.xml everything is fine.


All sorted now :)

EDIT: found the stderr file-

2017-04-27 16:11:08 (16424): Setting Memory Size for VM. (5000MB)

The WU is 49% complete, if I can sort out what app config file to run from now on I will try it, change witch ever settings you guys recommended within lhc and see what happens :)
ID: 30091 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,893,032
RAC: 138,165
Message 30093 - Posted: 27 Apr 2017, 19:24:30 UTC

Some suggestions for possible next steps.


1. Check the logfile

After your WU is reported check the result on the LHC webserver (it includes a copy of your stderr.txt).
- The WU should be marked as "successful"
- the logfile should include lines like
Guest Log: <metadata att_name="fsize" att_value="54070367"/>
Guest Log: -rw------- 1 root root 54070367 Apr 27 14:01 HITS.10995533._009865.pool.root.1


If this is successful, go to step 2



2. Try 1 multicore WU

Leave "Max # jobs = 1", set "Max # CPUs = 2", set <avg_ncpus>2.0</avg_ncpus> and "read config files" in your client

OR

Leave "Max # jobs = 1", set "Max # CPUs = 4", set <avg_ncpus>4.0</avg_ncpus>, set <cmdline>--memory_size_mb 6000</cmdline> and "read config files" in your client

If this is successful, go to step 3



3. Try several multicore WUs concurrently

Increase "Max # jobs" step by step either with "Max # CPUs = 2" or "Max # CPUs = 4" and set your app_config.xml accordingly.
<max_concurrent>x
<avg_ncpus>y
<cmdline>--memory_size_mb zzzz
<project_max_concurrent>x

Don´t forget the "read config files" before the next WU download.



Always check the logfiles before you go from one step to the next.


At a certain point your host will start to produce errors because of
- faulty WUs -> check the message boards
- other projects also need resources
- a saturated internet connection -> how fast is it?
- a saturated disk IO -> a lot of users don´t check/believe this point
- not enough RAM -> test another combination of #WUs / cores per WU / RAM per WU
- not enough CPUs -> unlikely in your case :-)
ID: 30093 · Report as offensive     Reply Quote
BRG

Send message
Joined: 23 Dec 16
Posts: 26
Credit: 776,007
RAC: 0
Message 30094 - Posted: 27 Apr 2017, 19:35:35 UTC - in response to Message 30093.  

Some suggestions for possible next steps.


1. Check the logfile

After your WU is reported check the result on the LHC webserver (it includes a copy of your stderr.txt).
- The WU should be marked as "successful"
- the logfile should include lines like
Guest Log: <metadata att_name="fsize" att_value="54070367"/>
Guest Log: -rw------- 1 root root 54070367 Apr 27 14:01 HITS.10995533._009865.pool.root.1


If this is successful, go to step 2



2. Try 1 multicore WU

Leave "Max # jobs = 1", set "Max # CPUs = 2", set <avg_ncpus>2.0</avg_ncpus> and "read config files" in your client

OR

Leave "Max # jobs = 1", set "Max # CPUs = 4", set <avg_ncpus>4.0</avg_ncpus>, set <cmdline>--memory_size_mb 6000</cmdline> and "read config files" in your client

If this is successful, go to step 3



3. Try several multicore WUs concurrently

Increase "Max # jobs" step by step either with "Max # CPUs = 2" or "Max # CPUs = 4" and set your app_config.xml accordingly.
<max_concurrent>x
<avg_ncpus>y
<cmdline>--memory_size_mb zzzz
<project_max_concurrent>x

Don´t forget the "read config files" before the next WU download.



Always check the logfiles before you go from one step to the next.


At a certain point your host will start to produce errors because of
- faulty WUs -> check the message boards
- other projects also need resources
- a saturated internet connection -> how fast is it?
- a saturated disk IO -> a lot of users don´t check/believe this point
- not enough RAM -> test another combination of #WUs / cores per WU / RAM per WU
- not enough CPUs -> unlikely in your case :-)


Thanks very much :) its got around an hour to go so will be a tomorrow job I would guess.

Will edit the app config file to the changes you said and then go from there via the steps :)

Thanks
ID: 30094 · Report as offensive     Reply Quote
BRG

Send message
Joined: 23 Dec 16
Posts: 26
Credit: 776,007
RAC: 0
Message 30108 - Posted: 29 Apr 2017, 12:54:13 UTC - in response to Message 30094.  

Quick update, once this WU has finished will start step 3 and report back but so far, so good :)
ID: 30108 · Report as offensive     Reply Quote
BRG

Send message
Joined: 23 Dec 16
Posts: 26
Credit: 776,007
RAC: 0
Message 30111 - Posted: 29 Apr 2017, 17:06:04 UTC

Ok, So far, so good!

Changed <max_concurrent> and <project_max_concurrent> to 4 as with 5400 on the ram and 2 cores that's the most I can do and it gives me a little room too!

Is there a tried and tested "thing" of x cores and x ram? I was always told 2 cores and 4800 ram...

In answer to your questions computezrmle:

- faulty WUs -> check the message boards
- other projects also need resources
- a saturated internet connection -> how fast is it?
- a saturated disk IO -> a lot of users don´t check/believe this point
- not enough RAM -> test another combination of #WUs / cores per WU / RAM per WU
- not enough CPUs -> unlikely in your case :-)

1, I have been, there is a few around, 27th I, like many had problem WU's
2, at the moment I only have one gpu slot running so no risks there!
3, my connection isn't great, 1.5mb down and around 0.1mb up
4, I'm not sure exactly what that is so will google it, drive isn't very old, Samsung evo 850 500g
5, ram is my issue I can go to 48g in total I think... currently 32g fitted
6, cores are not an issue currently, I do however need more ram to support those cores :(
ID: 30111 · Report as offensive     Reply Quote
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 798
Credit: 644,680,014
RAC: 235,402
Message 30115 - Posted: 29 Apr 2017, 20:13:58 UTC

On my PC with 12 cores 24 threads, I can max out 64GB if there is too many ATLAS tasks.

I've seen it very high on my 10 core 20 thread machine too.

My other PC's with more ram I haven't seen so many concurrent ATLAS task.

I have the number of task set to 10 concurrent for 64GB to see if that is a bit better as 12 made the maxed one slow.
ID: 30115 · Report as offensive     Reply Quote
BRG

Send message
Joined: 23 Dec 16
Posts: 26
Credit: 776,007
RAC: 0
Message 30116 - Posted: 29 Apr 2017, 21:16:29 UTC

so back to problems again... can not run more than 2 altas tasks now, and only 3 sizetrack tasks running... plenty cores free and sizetrack isn't bother about ram...
ID: 30116 · Report as offensive     Reply Quote
1 · 2 · Next

Message boards : ATLAS application : issues with app config/running multiple tasks


©2024 CERN