Thread 'issues with app config/running multiple tasks'

Author	Message
BRG Send message Joined: 23 Dec 16 Posts: 26 Credit: 776,007 RAC: 0	Message 30066 - Posted: 26 Apr 2017, 20:24:41 UTC Evening all, Last few days Bionic has only allowed me to run a couple of atlas tasks at once rather than the max set of 6 but normally 4 due to ram... I have everything set to use 100% (cpu and ram) within Bionic, checked settings on the lhc side of things too that's all at max, jobs set to no limit, cpus I have tried from no limit to 24, now at 24 and its only allowing one task. 24 cores and 32g ram app config: <?xml version="1.0"?> -<app_config> -<app> <name>ATLAS</name> <max_concurrent>6</max_concurrent> </app> -<app_version> <app_name>ATLAS</app_name> <avg_ncpus>2.000000</avg_ncpus> <plan_class>vbox64_mt_mcore_atlas</plan_class> <cmdline>--memory_size_mb 4800</cmdline> </app_version> </app_config> ID: 30066 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1530 Credit: 10,025,154 RAC: 1,422	Message 30067 - Posted: 26 Apr 2017, 20:47:47 UTC - in response to Message 30066. Set in your preferences the # of CPU's also to 2 when you have 2 in your app_config.xml ID: 30067 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2724 Credit: 299,095,947 RAC: 11,595	Message 30068 - Posted: 26 Apr 2017, 21:20:51 UTC - in response to Message 30066. Your hosts are hidden. Expert users canÂ´t check your logs. You may change your preferences and make your hosts visible. Your app_config.xml looks strange. Is it due to the copy/paste or are there really lines like: <?xml version="1.0"?> -<app_config> -<app> Your setting <avg_ncpus>2.000000</avg_ncpus> overrules the website preference "Max # of CPUs = 24" except the serverÂ´s working set size calculation which is now 9000MB per WU. Reduce the website preferences to not more than the value that you use in your app_config.xml. A 24 core host would be able to run 3 8-core WUs (3x9000MB = 27000MB). If you configure 4-core WUs 5800MB would be required per WU. This would use 20 CPUs. ID: 30068 · Reply Quote

BRG Send message Joined: 23 Dec 16 Posts: 26 Credit: 776,007 RAC: 0	Message 30070 - Posted: 26 Apr 2017, 21:58:33 UTC - in response to Message 30068. Last modified: 26 Apr 2017, 21:58:53 UTC Set in your preferences the # of CPU's also to 2 when you have 2 in your app_config.xml Now done :) thanks Your hosts are hidden. Expert users canÂ´t check your logs. You may change your preferences and make your hosts visible. Your app_config.xml looks strange. Is it due to the copy/paste or are there really lines like: <?xml version="1.0"?> -<app_config> -<app> Your setting <avg_ncpus>2.000000</avg_ncpus> overrules the website preference "Max # of CPUs = 24" except the serverÂ´s working set size calculation which is now 9000MB per WU. Reduce the website preferences to not more than the value that you use in your app_config.xml. A 24 core host would be able to run 3 8-core WUs (3x9000MB = 27000MB). If you configure 4-core WUs 5800MB would be required per WU. This would use 20 CPUs. I will allow computers to show now :) it maybe due to the copy and paste... hmmm, this is via edit: <app_config> <app> <name>ATLAS</name> <max_concurrent>6</max_concurrent> </app> <app_version> <app_name>ATLAS</app_name> <avg_ncpus>2.000000</avg_ncpus> <plan_class>vbox64_mt_mcore_atlas</plan_class> <cmdline>--memory_size_mb 4800</cmdline> </app_version> </app_config> What do you advise I do? I was told 2 cores per workunit?!?! and using the config setting above. ID: 30070 · Reply Quote

BRG Send message Joined: 23 Dec 16 Posts: 26 Credit: 776,007 RAC: 0	Message 30071 - Posted: 26 Apr 2017, 22:11:08 UTC Deleted app data file and still no change... closed and opened bionic etc ID: 30071 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1530 Credit: 10,025,154 RAC: 1,422	Message 30073 - Posted: 27 Apr 2017, 6:07:18 UTC - in response to Message 30071. Deleted app data file and still no change... closed and opened bionic etc I suppose you still have tasks in queue, you already got before your changes. ID: 30073 · Reply Quote

BRG Send message Joined: 23 Dec 16 Posts: 26 Credit: 776,007 RAC: 0	Message 30077 - Posted: 27 Apr 2017, 9:29:38 UTC - in response to Message 30073. Deleted app data file and still no change... closed and opened bionic etc I suppose you still have tasks in queue, you already got before your changes. I did delete them, however I left the ones that where running to run, back in work this morning and still only one running and the others saying waiting for memory ID: 30077 · Reply Quote

BRG Send message Joined: 23 Dec 16 Posts: 26 Credit: 776,007 RAC: 0	Message 30078 - Posted: 27 Apr 2017, 9:51:54 UTC - in response to Message 30077. So, just finished 1 task and deleted 4... removed the app config file and just downloaded 2 WU's, both now running for 1 minute, before it would do seconds then stop... without jumping to conclusions it must be an app data file error?!?!?! ID: 30078 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2724 Credit: 299,095,947 RAC: 11,595	Message 30079 - Posted: 27 Apr 2017, 10:32:38 UTC ou ever worked through YetiÂ´s checklist? Fine. Beside that you may restart the project with conservative settings. 1. Let your local WU cache get empty 2. Reset the project in BOINC 3. Update your VirtualBox software to the most recent version 4. Reboot your host 5. Set "Max # jobs = 1" and "Max # CPUs = 1" on the LHC website 6. Create the following app_config.xml [pre]<app_config> <app> <name>ATLAS</name> <max_concurrent>1</max_concurrent> <fraction_done_exact/> </app> <app_version> <app_name>ATLAS</app_name> <plan_class>vbox64_mt_mcore_atlas</plan_class> <avg_ncpus>1.0</avg_ncpus> <cmdline>--memory_size_mb 5000</cmdline> </app_version> <project_max_concurrent>1</project_max_concurrent> </app_config> [/pre] 7. Request a new WU from the project 8. Reload your configuration (must be done after you got the first WU and before the WU starts) 9. Check the result before you change your settings and request a new WU ID: 30079 · Reply Quote

BRG Send message Joined: 23 Dec 16 Posts: 26 Credit: 776,007 RAC: 0	Message 30085 - Posted: 27 Apr 2017, 14:00:16 UTC - in response to Message 30079. ]Have you ever worked through YetiÂ´s checklist? Fine. Beside that you may restart the project with conservative settings. 1. Let your local WU cache get empty 2. Reset the project in BOINC 3. Update your VirtualBox software to the most recent version 4. Reboot your host 5. Set "Max # jobs = 1" and "Max # CPUs = 1" on the LHC website 6. Create the following app_config.xml [pre]<app_config> <app> <name>ATLAS</name> <max_concurrent>1</max_concurrent> <fraction_done_exact/> </app> <app_version> <app_name>ATLAS</app_name> <plan_class>vbox64_mt_mcore_atlas</plan_class> <avg_ncpus>1.0</avg_ncpus> <cmdline>--memory_size_mb 5000</cmdline> </app_version> <project_max_concurrent>1</project_max_concurrent> </app_config> [/pre] 7. Request a new WU from the project 8. Reload your configuration (must be done after you got the first WU and before the WU starts) 9. Check the result before you change your settings and request a new WU[/quote] I did go through his checklist last night, it was his check list that made me check preferences within lhc computing preferences :) I have set it to not allow more tasks, will complete these 2 task... follow your list and then post back. thanks :) ID: 30085 · Reply Quote

BRG Send message Joined: 23 Dec 16 Posts: 26 Credit: 776,007 RAC: 0	Message 30088 - Posted: 27 Apr 2017, 15:29:47 UTC LHC@home: Notice from BOINC Your app_config.xml file refers to an unknown application 'ATLAS'. Known applications: None 27/04/2017 3:59:09 PM Had this come up, however its gone now... Will run this task through, pause everything and post again. ID: 30088 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1957 Credit: 158,710,976 RAC: 55,924	Message 30089 - Posted: 27 Apr 2017, 16:14:42 UTC - in response to Message 30088. Your app_config.xml file refers to an unknown application 'ATLAS'. ... Had this come up, however its gone now... well, BOINC shows this notice only once, when you go to "Options" - "read config files". When you repeat doing this, and the notice shows up again, then something is going wrong. ID: 30089 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2724 Credit: 299,095,947 RAC: 11,595	Message 30090 - Posted: 27 Apr 2017, 16:20:36 UTC This happens after every project reset until ATLAS (in this case) is known to your host through the first server response. Nothing to worry about if you managed to load the app_config.xml before BOINC started the WU. See number 8 of my list. You may check the stderr.txt in the slots dir of the running WU. If "Setting Memory Size for VM. (xxxxMB)" corresponds to your app_config.xml everything is fine. ID: 30090 · Reply Quote

BRG Send message Joined: 23 Dec 16 Posts: 26 Credit: 776,007 RAC: 0	Message 30091 - Posted: 27 Apr 2017, 16:42:13 UTC - in response to Message 30090. Last modified: 27 Apr 2017, 16:46:56 UTC This happens after every project reset until ATLAS (in this case) is known to your host through the first server response. Nothing to worry about if you managed to load the app_config.xml before BOINC started the WU. See number 8 of my list. You may check the stderr.txt in the slots dir of the running WU. If "Setting Memory Size for VM. (xxxxMB)" corresponds to your app_config.xml everything is fine. All sorted now :) EDIT: found the stderr file- 2017-04-27 16:11:08 (16424): Setting Memory Size for VM. (5000MB) The WU is 49% complete, if I can sort out what app config file to run from now on I will try it, change witch ever settings you guys recommended within lhc and see what happens :) ID: 30091 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2724 Credit: 299,095,947 RAC: 11,595	Message 30093 - Posted: 27 Apr 2017, 19:24:30 UTC Some suggestions for possible next steps. 1. Check the logfile After your WU is reported check the result on the LHC webserver (it includes a copy of your stderr.txt). - The WU should be marked as "successful" - the logfile should include lines like Guest Log: <metadata att_name="fsize" att_value="54070367"/> Guest Log: -rw------- 1 root root 54070367 Apr 27 14:01 HITS.10995533._009865.pool.root.1 If this is successful, go to step 2 2. Try 1 multicore WU Leave "Max # jobs = 1", set "Max # CPUs = 2", set <avg_ncpus>2.0</avg_ncpus> and "read config files" in your client OR Leave "Max # jobs = 1", set "Max # CPUs = 4", set <avg_ncpus>4.0</avg_ncpus>, set <cmdline>--memory_size_mb 6000</cmdline> and "read config files" in your client If this is successful, go to step 3 3. Try several multicore WUs concurrently Increase "Max # jobs" step by step either with "Max # CPUs = 2" or "Max # CPUs = 4" and set your app_config.xml accordingly. <max_concurrent>x <avg_ncpus>y <cmdline>--memory_size_mb zzzz <project_max_concurrent>x DonÂ´t forget the "read config files" before the next WU download. Always check the logfiles before you go from one step to the next. At a certain point your host will start to produce errors because of - faulty WUs -> check the message boards - other projects also need resources - a saturated internet connection -> how fast is it? - a saturated disk IO -> a lot of users donÂ´t check/believe this point - not enough RAM -> test another combination of #WUs / cores per WU / RAM per WU - not enough CPUs -> unlikely in your case :-) ID: 30093 · Reply Quote

BRG Send message Joined: 23 Dec 16 Posts: 26 Credit: 776,007 RAC: 0	Message 30094 - Posted: 27 Apr 2017, 19:35:35 UTC - in response to Message 30093. Some suggestions for possible next steps. 1. Check the logfile After your WU is reported check the result on the LHC webserver (it includes a copy of your stderr.txt). - The WU should be marked as "successful" - the logfile should include lines like Guest Log: <metadata att_name="fsize" att_value="54070367"/> Guest Log: -rw------- 1 root root 54070367 Apr 27 14:01 HITS.10995533._009865.pool.root.1 If this is successful, go to step 2 2. Try 1 multicore WU Leave "Max # jobs = 1", set "Max # CPUs = 2", set <avg_ncpus>2.0</avg_ncpus> and "read config files" in your client OR Leave "Max # jobs = 1", set "Max # CPUs = 4", set <avg_ncpus>4.0</avg_ncpus>, set <cmdline>--memory_size_mb 6000</cmdline> and "read config files" in your client If this is successful, go to step 3 3. Try several multicore WUs concurrently Increase "Max # jobs" step by step either with "Max # CPUs = 2" or "Max # CPUs = 4" and set your app_config.xml accordingly. <max_concurrent>x <avg_ncpus>y <cmdline>--memory_size_mb zzzz <project_max_concurrent>x DonÂ´t forget the "read config files" before the next WU download. Always check the logfiles before you go from one step to the next. At a certain point your host will start to produce errors because of - faulty WUs -> check the message boards - other projects also need resources - a saturated internet connection -> how fast is it? - a saturated disk IO -> a lot of users donÂ´t check/believe this point - not enough RAM -> test another combination of #WUs / cores per WU / RAM per WU - not enough CPUs -> unlikely in your case :-) Thanks very much :) its got around an hour to go so will be a tomorrow job I would guess. Will edit the app config file to the changes you said and then go from there via the steps :) Thanks ID: 30094 · Reply Quote

BRG Send message Joined: 23 Dec 16 Posts: 26 Credit: 776,007 RAC: 0	Message 30108 - Posted: 29 Apr 2017, 12:54:13 UTC - in response to Message 30094. Quick update, once this WU has finished will start step 3 and report back but so far, so good :) ID: 30108 · Reply Quote

BRG Send message Joined: 23 Dec 16 Posts: 26 Credit: 776,007 RAC: 0	Message 30111 - Posted: 29 Apr 2017, 17:06:04 UTC Ok, So far, so good! Changed <max_concurrent> and <project_max_concurrent> to 4 as with 5400 on the ram and 2 cores that's the most I can do and it gives me a little room too! Is there a tried and tested "thing" of x cores and x ram? I was always told 2 cores and 4800 ram... In answer to your questions computezrmle: - faulty WUs -> check the message boards - other projects also need resources - a saturated internet connection -> how fast is it? - a saturated disk IO -> a lot of users donÂ´t check/believe this point - not enough RAM -> test another combination of #WUs / cores per WU / RAM per WU - not enough CPUs -> unlikely in your case :-) 1, I have been, there is a few around, 27th I, like many had problem WU's 2, at the moment I only have one gpu slot running so no risks there! 3, my connection isn't great, 1.5mb down and around 0.1mb up 4, I'm not sure exactly what that is so will google it, drive isn't very old, Samsung evo 850 500g 5, ram is my issue I can go to 48g in total I think... currently 32g fitted 6, cores are not an issue currently, I do however need more ram to support those cores :( ID: 30111 · Reply Quote

Toby Broom Volunteer moderator Send message Joined: 27 Sep 08 Posts: 910 Credit: 777,113,795 RAC: 189,619	Message 30115 - Posted: 29 Apr 2017, 20:13:58 UTC On my PC with 12 cores 24 threads, I can max out 64GB if there is too many ATLAS tasks. I've seen it very high on my 10 core 20 thread machine too. My other PC's with more ram I haven't seen so many concurrent ATLAS task. I have the number of task set to 10 concurrent for 64GB to see if that is a bit better as 12 made the maxed one slow. ID: 30115 · Reply Quote

BRG Send message Joined: 23 Dec 16 Posts: 26 Credit: 776,007 RAC: 0	Message 30116 - Posted: 29 Apr 2017, 21:16:29 UTC so back to problems again... can not run more than 2 altas tasks now, and only 3 sizetrack tasks running... plenty cores free and sizetrack isn't bother about ram... ID: 30116 · Reply Quote