log in

issues with app config/running multiple tasks


Advanced search

Message boards : ATLAS application : issues with app config/running multiple tasks

1 · 2 · Next
Author Message
BRG
Send message
Joined: 23 Dec 16
Posts: 26
Credit: 776,007
RAC: 0
Message 30066 - Posted: 26 Apr 2017, 20:24:41 UTC

Evening all,

Last few days Bionic has only allowed me to run a couple of atlas tasks at once rather than the max set of 6 but normally 4 due to ram...

I have everything set to use 100% (cpu and ram) within Bionic, checked settings on the lhc side of things too that's all at max, jobs set to no limit, cpus I have tried from no limit to 24, now at 24 and its only allowing one task.

24 cores and 32g ram

app config:

<?xml version="1.0"?>

-<app_config>


-<app>

<name>ATLAS</name>

<max_concurrent>6</max_concurrent>

</app>


-<app_version>

<app_name>ATLAS</app_name>

<avg_ncpus>2.000000</avg_ncpus>

<plan_class>vbox64_mt_mcore_atlas</plan_class>

<cmdline>--memory_size_mb 4800</cmdline>

</app_version>

</app_config>

Crystal Pellet
Volunteer moderator
Volunteer tester
Send message
Joined: 14 Jan 10
Posts: 384
Credit: 2,997,809
RAC: 2,011
Message 30067 - Posted: 26 Apr 2017, 20:47:47 UTC - in response to Message 30066.

Set in your preferences the # of CPU's also to 2 when you have 2 in your app_config.xml

computezrmle
Send message
Joined: 15 Jun 08
Posts: 347
Credit: 3,501,271
RAC: 1,830
Message 30068 - Posted: 26 Apr 2017, 21:20:51 UTC - in response to Message 30066.

Your hosts are hidden.
Expert users can´t check your logs.
You may change your preferences and make your hosts visible.

Your app_config.xml looks strange.
Is it due to the copy/paste or are there really lines like:

<?xml version="1.0"?>

-<app_config>


-<app>



Your setting
<avg_ncpus>2.000000</avg_ncpus>

overrules the website preference "Max # of CPUs = 24" except the server´s working set size calculation which is now 9000MB per WU.
Reduce the website preferences to not more than the value that you use in your app_config.xml.

A 24 core host would be able to run 3 8-core WUs (3x9000MB = 27000MB).
If you configure 4-core WUs 5800MB would be required per WU.
This would use 20 CPUs.

BRG
Send message
Joined: 23 Dec 16
Posts: 26
Credit: 776,007
RAC: 0
Message 30070 - Posted: 26 Apr 2017, 21:58:33 UTC - in response to Message 30068.
Last modified: 26 Apr 2017, 21:58:53 UTC

Set in your preferences the # of CPU's also to 2 when you have 2 in your app_config.xml


Now done :) thanks


Your hosts are hidden.
Expert users can´t check your logs.
You may change your preferences and make your hosts visible.

Your app_config.xml looks strange.
Is it due to the copy/paste or are there really lines like:
<?xml version="1.0"?>

-<app_config>


-<app>



Your setting
<avg_ncpus>2.000000</avg_ncpus>

overrules the website preference "Max # of CPUs = 24" except the server´s working set size calculation which is now 9000MB per WU.
Reduce the website preferences to not more than the value that you use in your app_config.xml.

A 24 core host would be able to run 3 8-core WUs (3x9000MB = 27000MB).
If you configure 4-core WUs 5800MB would be required per WU.
This would use 20 CPUs.


I will allow computers to show now :)

it maybe due to the copy and paste... hmmm, this is via edit:

<app_config>
<app>
<name>ATLAS</name>
<max_concurrent>6</max_concurrent>
</app>
<app_version>
<app_name>ATLAS</app_name>
<avg_ncpus>2.000000</avg_ncpus>
<plan_class>vbox64_mt_mcore_atlas</plan_class>
<cmdline>--memory_size_mb 4800</cmdline>
</app_version>
</app_config>

What do you advise I do? I was told 2 cores per workunit?!?! and using the config setting above.

BRG
Send message
Joined: 23 Dec 16
Posts: 26
Credit: 776,007
RAC: 0
Message 30071 - Posted: 26 Apr 2017, 22:11:08 UTC

Deleted app data file and still no change... closed and opened bionic etc

Crystal Pellet
Volunteer moderator
Volunteer tester
Send message
Joined: 14 Jan 10
Posts: 384
Credit: 2,997,809
RAC: 2,011
Message 30073 - Posted: 27 Apr 2017, 6:07:18 UTC - in response to Message 30071.

Deleted app data file and still no change... closed and opened bionic etc

I suppose you still have tasks in queue, you already got before your changes.

BRG
Send message
Joined: 23 Dec 16
Posts: 26
Credit: 776,007
RAC: 0
Message 30077 - Posted: 27 Apr 2017, 9:29:38 UTC - in response to Message 30073.

Deleted app data file and still no change... closed and opened bionic etc

I suppose you still have tasks in queue, you already got before your changes.


I did delete them, however I left the ones that where running to run, back in work this morning and still only one running and the others saying waiting for memory

BRG
Send message
Joined: 23 Dec 16
Posts: 26
Credit: 776,007
RAC: 0
Message 30078 - Posted: 27 Apr 2017, 9:51:54 UTC - in response to Message 30077.

So, just finished 1 task and deleted 4... removed the app config file and just downloaded 2 WU's, both now running for 1 minute, before it would do seconds then stop... without jumping to conclusions it must be an app data file error?!?!?!

computezrmle
Send message
Joined: 15 Jun 08
Posts: 347
Credit: 3,501,271
RAC: 1,830
Message 30079 - Posted: 27 Apr 2017, 10:32:38 UTC

Have you ever worked through Yeti´s checklist?
Fine.

Beside that you may restart the project with conservative settings.

1. Let your local WU cache get empty
2. Reset the project in BOINC
3. Update your VirtualBox software to the most recent version
4. Reboot your host
5. Set "Max # jobs = 1" and "Max # CPUs = 1" on the LHC website
6. Create the following app_config.xml

<app_config> <app> <name>ATLAS</name> <max_concurrent>1</max_concurrent> <fraction_done_exact/> </app> <app_version> <app_name>ATLAS</app_name> <plan_class>vbox64_mt_mcore_atlas</plan_class> <avg_ncpus>1.0</avg_ncpus> <cmdline>--memory_size_mb 5000</cmdline> </app_version> <project_max_concurrent>1</project_max_concurrent> </app_config>


7. Request a new WU from the project
8. Reload your configuration (must be done after you got the first WU and before the WU starts)
9. Check the result before you change your settings and request a new WU

BRG
Send message
Joined: 23 Dec 16
Posts: 26
Credit: 776,007
RAC: 0
Message 30085 - Posted: 27 Apr 2017, 14:00:16 UTC - in response to Message 30079.

Have you ever worked through Yeti´s checklist?
Fine.

Beside that you may restart the project with conservative settings.

1. Let your local WU cache get empty
2. Reset the project in BOINC
3. Update your VirtualBox software to the most recent version
4. Reboot your host
5. Set "Max # jobs = 1" and "Max # CPUs = 1" on the LHC website
6. Create the following app_config.xml

<app_config> <app> <name>ATLAS</name> <max_concurrent>1</max_concurrent> <fraction_done_exact/> </app> <app_version> <app_name>ATLAS</app_name> <plan_class>vbox64_mt_mcore_atlas</plan_class> <avg_ncpus>1.0</avg_ncpus> <cmdline>--memory_size_mb 5000</cmdline> </app_version> <project_max_concurrent>1</project_max_concurrent> </app_config>


7. Request a new WU from the project
8. Reload your configuration (must be done after you got the first WU and before the WU starts)
9. Check the result before you change your settings and request a new WU


I did go through his checklist last night, it was his check list that made me check preferences within lhc computing preferences :)

I have set it to not allow more tasks, will complete these 2 task... follow your list and then post back. thanks :)

BRG
Send message
Joined: 23 Dec 16
Posts: 26
Credit: 776,007
RAC: 0
Message 30088 - Posted: 27 Apr 2017, 15:29:47 UTC

LHC@home: Notice from BOINC
Your app_config.xml file refers to an unknown application 'ATLAS'. Known applications: None
27/04/2017 3:59:09 PM


Had this come up, however its gone now...

Will run this task through, pause everything and post again.

Erich56
Send message
Joined: 18 Dec 15
Posts: 383
Credit: 3,873,774
RAC: 7,567
Message 30089 - Posted: 27 Apr 2017, 16:14:42 UTC - in response to Message 30088.

Your app_config.xml file refers to an unknown application 'ATLAS'. ...

Had this come up, however its gone now...

well, BOINC shows this notice only once, when you go to "Options" - "read config files".

When you repeat doing this, and the notice shows up again, then something is going wrong.

computezrmle
Send message
Joined: 15 Jun 08
Posts: 347
Credit: 3,501,271
RAC: 1,830
Message 30090 - Posted: 27 Apr 2017, 16:20:36 UTC

This happens after every project reset until ATLAS (in this case) is known to your host through the first server response.
Nothing to worry about if you managed to load the app_config.xml before BOINC started the WU.
See number 8 of my list.

You may check the stderr.txt in the slots dir of the running WU.
If "Setting Memory Size for VM. (xxxxMB)" corresponds to your app_config.xml everything is fine.

BRG
Send message
Joined: 23 Dec 16
Posts: 26
Credit: 776,007
RAC: 0
Message 30091 - Posted: 27 Apr 2017, 16:42:13 UTC - in response to Message 30090.
Last modified: 27 Apr 2017, 16:46:56 UTC

This happens after every project reset until ATLAS (in this case) is known to your host through the first server response.
Nothing to worry about if you managed to load the app_config.xml before BOINC started the WU.
See number 8 of my list.

You may check the stderr.txt in the slots dir of the running WU.
If "Setting Memory Size for VM. (xxxxMB)" corresponds to your app_config.xml everything is fine.


All sorted now :)

EDIT: found the stderr file-

2017-04-27 16:11:08 (16424): Setting Memory Size for VM. (5000MB)

The WU is 49% complete, if I can sort out what app config file to run from now on I will try it, change witch ever settings you guys recommended within lhc and see what happens :)

computezrmle
Send message
Joined: 15 Jun 08
Posts: 347
Credit: 3,501,271
RAC: 1,830
Message 30093 - Posted: 27 Apr 2017, 19:24:30 UTC

Some suggestions for possible next steps.


1. Check the logfile

After your WU is reported check the result on the LHC webserver (it includes a copy of your stderr.txt).
- The WU should be marked as "successful"
- the logfile should include lines like

Guest Log: <metadata att_name="fsize" att_value="54070367"/>
Guest Log: -rw------- 1 root root 54070367 Apr 27 14:01 HITS.10995533._009865.pool.root.1


If this is successful, go to step 2



2. Try 1 multicore WU

Leave "Max # jobs = 1", set "Max # CPUs = 2", set <avg_ncpus>2.0</avg_ncpus> and "read config files" in your client

OR

Leave "Max # jobs = 1", set "Max # CPUs = 4", set <avg_ncpus>4.0</avg_ncpus>, set <cmdline>--memory_size_mb 6000</cmdline> and "read config files" in your client

If this is successful, go to step 3



3. Try several multicore WUs concurrently

Increase "Max # jobs" step by step either with "Max # CPUs = 2" or "Max # CPUs = 4" and set your app_config.xml accordingly.
<max_concurrent>x
<avg_ncpus>y
<cmdline>--memory_size_mb zzzz
<project_max_concurrent>x

Don´t forget the "read config files" before the next WU download.



Always check the logfiles before you go from one step to the next.


At a certain point your host will start to produce errors because of
- faulty WUs -> check the message boards
- other projects also need resources
- a saturated internet connection -> how fast is it?
- a saturated disk IO -> a lot of users don´t check/believe this point
- not enough RAM -> test another combination of #WUs / cores per WU / RAM per WU
- not enough CPUs -> unlikely in your case :-)

BRG
Send message
Joined: 23 Dec 16
Posts: 26
Credit: 776,007
RAC: 0
Message 30094 - Posted: 27 Apr 2017, 19:35:35 UTC - in response to Message 30093.

Some suggestions for possible next steps.


1. Check the logfile

After your WU is reported check the result on the LHC webserver (it includes a copy of your stderr.txt).
- The WU should be marked as "successful"
- the logfile should include lines like
Guest Log: <metadata att_name="fsize" att_value="54070367"/>
Guest Log: -rw------- 1 root root 54070367 Apr 27 14:01 HITS.10995533._009865.pool.root.1


If this is successful, go to step 2



2. Try 1 multicore WU

Leave "Max # jobs = 1", set "Max # CPUs = 2", set <avg_ncpus>2.0</avg_ncpus> and "read config files" in your client

OR

Leave "Max # jobs = 1", set "Max # CPUs = 4", set <avg_ncpus>4.0</avg_ncpus>, set <cmdline>--memory_size_mb 6000</cmdline> and "read config files" in your client

If this is successful, go to step 3



3. Try several multicore WUs concurrently

Increase "Max # jobs" step by step either with "Max # CPUs = 2" or "Max # CPUs = 4" and set your app_config.xml accordingly.
<max_concurrent>x
<avg_ncpus>y
<cmdline>--memory_size_mb zzzz
<project_max_concurrent>x

Don´t forget the "read config files" before the next WU download.



Always check the logfiles before you go from one step to the next.


At a certain point your host will start to produce errors because of
- faulty WUs -> check the message boards
- other projects also need resources
- a saturated internet connection -> how fast is it?
- a saturated disk IO -> a lot of users don´t check/believe this point
- not enough RAM -> test another combination of #WUs / cores per WU / RAM per WU
- not enough CPUs -> unlikely in your case :-)


Thanks very much :) its got around an hour to go so will be a tomorrow job I would guess.

Will edit the app config file to the changes you said and then go from there via the steps :)

Thanks

BRG
Send message
Joined: 23 Dec 16
Posts: 26
Credit: 776,007
RAC: 0
Message 30108 - Posted: 29 Apr 2017, 12:54:13 UTC - in response to Message 30094.

Quick update, once this WU has finished will start step 3 and report back but so far, so good :)

BRG
Send message
Joined: 23 Dec 16
Posts: 26
Credit: 776,007
RAC: 0
Message 30111 - Posted: 29 Apr 2017, 17:06:04 UTC

Ok, So far, so good!

Changed <max_concurrent> and <project_max_concurrent> to 4 as with 5400 on the ram and 2 cores that's the most I can do and it gives me a little room too!

Is there a tried and tested "thing" of x cores and x ram? I was always told 2 cores and 4800 ram...

In answer to your questions computezrmle:

- faulty WUs -> check the message boards
- other projects also need resources
- a saturated internet connection -> how fast is it?
- a saturated disk IO -> a lot of users don´t check/believe this point
- not enough RAM -> test another combination of #WUs / cores per WU / RAM per WU
- not enough CPUs -> unlikely in your case :-)

1, I have been, there is a few around, 27th I, like many had problem WU's
2, at the moment I only have one gpu slot running so no risks there!
3, my connection isn't great, 1.5mb down and around 0.1mb up
4, I'm not sure exactly what that is so will google it, drive isn't very old, Samsung evo 850 500g
5, ram is my issue I can go to 48g in total I think... currently 32g fitted
6, cores are not an issue currently, I do however need more ram to support those cores :(

Toby Broom
Volunteer moderator
Send message
Joined: 27 Sep 08
Posts: 376
Credit: 88,662,866
RAC: 174,150
Message 30115 - Posted: 29 Apr 2017, 20:13:58 UTC

On my PC with 12 cores 24 threads, I can max out 64GB if there is too many ATLAS tasks.

I've seen it very high on my 10 core 20 thread machine too.

My other PC's with more ram I haven't seen so many concurrent ATLAS task.

I have the number of task set to 10 concurrent for 64GB to see if that is a bit better as 12 made the maxed one slow.

BRG
Send message
Joined: 23 Dec 16
Posts: 26
Credit: 776,007
RAC: 0
Message 30116 - Posted: 29 Apr 2017, 21:16:29 UTC

so back to problems again... can not run more than 2 altas tasks now, and only 3 sizetrack tasks running... plenty cores free and sizetrack isn't bother about ram...

1 · 2 · Next

Message boards : ATLAS application : issues with app config/running multiple tasks