best practices/how to get most efficiency for higher core machines

Author	Message
wolfman1360 Send message Joined: 17 Feb 17 Posts: 42 Credit: 2,589,736 RAC: 0	Message 45183 - Posted: 11 Aug 2021, 22:52:15 UTC Hello I've been out of the loop on this project for a while and decided to give it another go. I have nearly 200 cores to add, so just want to make sure I'm not going to bungle this with a lot of errors or stuck vms. I've got several 24 to 48 thread Xeons with anywhere from 32-64 gb ram, as well as a few 8 thread i7's with 16-32, all running Linux. Squid will be used, but my biggest concern is ram and stuck vms. I know Theory takes the least ram per WU - but for best efficiency, what would folks recommend for managing Atlas and CMS workunits? Can Boinc be trusted to manage ram on its own? Still trying to figure out number of workunits vs. number of CPUs, which I believe the latter only has to do with Atlas. Most of these, apart from a few, are running cheap ssds since I figure a lot of disk activity will be going on and with a lot of WUs crunching at the same time that might be a factor in how fast they start/stop, especially in regards to CMS. I'm not sure what the process is for each one to start and finish. Right now the goal is to add machines very slowly, making sure each one can crunch Atlas, Theory, and CMS with one WU sent of each before moving onto the next. Examples of processors and ram configurations - e5-2670v3 with 64 gb, e5-2680 with 32. If I remember right CMS and Atlas are the biggest users of bandwidth and disk? thanks and any help appreciated! ID: 45183 · Reply Quote

Toby Broom Volunteer moderator Send message Joined: 27 Sep 08 Posts: 859 Credit: 703,792,323 RAC: 161,849	Message 45189 - Posted: 12 Aug 2021, 17:04:11 UTC - in response to Message 45183. I don't get many stuck VMs anymore. ATLAS is the most tricky to run, if you want to allow unlimited WUs then it tries to use 10GB of memory per WU, you can of course tweek this but then its hard to keep the memory usage on track manually. CMS runs smoothly. I don't think there is much disk activity in general, peak total transfers are 38% peak writes are about 40 and 100 for reads. A squid proxy will reduce the load on the CERN servers and your internet usage. I use 90% of total thread to give some breathing room, for OS overhead running 42 CMS at once on a 48 thread system this however is 96 GB of ram usage and 100% CPU load. e.g. on one computer now, I have 14 ATLAS, 17 CMS and 5 theory this is using 98% CPU and 156 GB of memory. ID: 45189 · Reply Quote

wolfman1360 Send message Joined: 17 Feb 17 Posts: 42 Credit: 2,589,736 RAC: 0	Message 45190 - Posted: 12 Aug 2021, 18:06:51 UTC - in response to Message 45189. Last modified: 12 Aug 2021, 18:07:26 UTC I don't get many stuck VMs anymore. Great to hear! ATLAS is the most tricky to run, if you want to allow unlimited WUs then it tries to use 10GB of memory per WU, you can of course tweek this but then its hard to keep the memory usage on track manually. CMS runs smoothly. 10 gb per wu? Is this for a single core workunit? What if I only select a certain number of them? Or select 1 wu to use 24 cores? I don't think there is much disk activity in general, peak total transfers are 38% peak writes are about 40 and 100 for reads. A squid proxy will reduce the load on the CERN servers and your internet usage. I use 90% of total thread to give some breathing room, for OS overhead running 42 CMS at once on a 48 thread system this however is 96 GB of ram usage and 100% CPU load. e.g. on one computer now, I have 14 ATLAS, 17 CMS and 5 theory this is using 98% CPU and 156 GB of memory. Thank you for those figures. How is bandwidth for CMS and Atlas? I could probably get away with running a lot more native theory tasks, I'm guessing. ID: 45190 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2607 Credit: 262,565,847 RAC: 138,862	Message 45191 - Posted: 12 Aug 2021, 18:38:08 UTC - in response to Message 45190. ATLAS RAM setting is calculated server side as: 3000 + 900 * n_cores "Unlimited" means: Without local tweaking ATLAS uses up to 8 cores for vbox tasks and up to 12 cores for native tasks. As a result the RAM calculation limit is 10200 MB for vbox and 13800 MB for native. (I'm not 100% sure if the native limit is still active). Modern internet connections usually don't suffer from low bandwidth. The limiting factors are latency and not enough RAM on the router(s) to handle the large number of concurrently open connections. Especially for this project since it transfers thousands of very small files. A local Squid keeps those connections inside your LAN, hence offloads routers (including the local one) and target servers. In case of CMS's frontier requests the reuse factor can be greater than 95 %. ID: 45191 · Reply Quote

Toby Broom Volunteer moderator Send message Joined: 27 Sep 08 Posts: 859 Credit: 703,792,323 RAC: 161,849	Message 45192 - Posted: 12 Aug 2021, 18:40:50 UTC - in response to Message 45190. Last modified: 12 Aug 2021, 18:41:59 UTC 10 gb per wu? Is this for a single core workunit? What if I only select a certain number of them? Or select 1 wu to use 24 cores? Its something to do with the unlimited selection, so BOINC will get more than 8 WU's, then I force back to a single core with appconfig. I can't remember what happens if you pick 1 core and unlimited jobs, then the ram is (what cp said) GB, same as CMS but then there is a limit on the number of WUs, not sure what the limit was though. How is bandwidth for CMS and Atlas? I could probably get away with running a lot more native theory tasks, I'm guessing. network? My squid proxy has peak of 120 MB/s up and 92 down with averages of 1.9 and 0.23 (feeding 240 threads). The disk is about the same one the one running a mix of WUs as one running all CMS. The squid one is a bit more intense on the disk, but it's like 90% cached files in ram so again not so bad. ID: 45192 · Reply Quote

wolfman1360 Send message Joined: 17 Feb 17 Posts: 42 Credit: 2,589,736 RAC: 0	Message 45193 - Posted: 12 Aug 2021, 20:02:12 UTC Thank you for all of that. I'll have to play with settings and maybe app config to figure out the best solution. Maybe 2 12 core WUs from Atlas and the rest for CMS and theory. How much ram does native theory and CMS use? ID: 45193 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2607 Credit: 262,565,847 RAC: 138,862	Message 45194 - Posted: 12 Aug 2021, 20:37:58 UTC - in response to Message 45193. native Theory: usually 600-800 MB per task. BUT! occasionally there will be special tasks (madgraph) that allocate >6.5 GB! plus a 2nd core. CMS If not tweaked each task will set up a 2 GB VM. + some MB for vboxwrapper ID: 45194 · Reply Quote

wolfman1360 Send message Joined: 17 Feb 17 Posts: 42 Credit: 2,589,736 RAC: 0	Message 45195 - Posted: 12 Aug 2021, 20:50:07 UTC - in response to Message 45194. native Theory: usually 600-800 MB per task. BUT! occasionally there will be special tasks (madgraph) that allocate >6.5 GB! plus a 2nd core. CMS If not tweaked each task will set up a 2 GB VM. + some MB for vboxwrapper CMS - not tweaked? How can one tweak them and what will result? thanks ID: 45195 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2260 Credit: 175,581,097 RAC: 11,545	Message 45198 - Posted: 13 Aug 2021, 3:58:40 UTC Last modified: 13 Aug 2021, 4:31:17 UTC 6 Atlas-native with 12 CPU's 100 GByte RAM on a CentOS8 XEON with 72 CPU's (6x12). https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10587392 The Number One in our Computer Hitlist. Btw 6 of the best 20 Computer are.... from Toby Broom! ID: 45198 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2607 Credit: 262,565,847 RAC: 138,862	Message 45199 - Posted: 13 Aug 2021, 6:30:54 UTC - in response to Message 45195. CMS - not tweaked? How can one tweak them and what will result? You can tweak some parameters using an app_config.xml. See this page for details: https://boinc.berkeley.edu/wiki/Client_configuration Your own app_config.xml must strictly follow the template shown there. Default values can be found in client_state.xml Mostly used for LHC@home tweaking: <max_concurrent>n</max_concurrent> <project_max_concurrent>N</project_max_concurrent> <avg_ncpus>x</avg_ncpus> # 1) The VM's RAM size can be tweaked using <cmdline>--memory_size_mb 2048</cmdline> # 2) 1) The manual explains: "...(possibly fractional) ..." but this makes no sense here since it also tells vboxwrapper how many cores it should configure for the VM. The latter only accepts integer values, hence use "x" or "x.0". 2) 2048 is the default for CMS and doesn't need to be specified here. Setting it higher would be waste of RAM since a VM never returns allocated RAM to the OS. Setting it a bit lower will slow down the VM. Setting it much lower will cause the scientific app not to run since it checks if enough RAM is available. ID: 45199 · Reply Quote

Toby Broom Volunteer moderator Send message Joined: 27 Sep 08 Posts: 859 Credit: 703,792,323 RAC: 161,849	Message 45200 - Posted: 13 Aug 2021, 17:28:40 UTC - in response to Message 45199. The VM's RAM size can be tweaked using <cmdline>--memory_size_mb 2048</cmdline> # 2) take care with this one as BOINC doesn't know how much RAM is used, so you can end up with too much RAM usage and your computer locks up. ID: 45200 · Reply Quote

wolfman1360 Send message Joined: 17 Feb 17 Posts: 42 Credit: 2,589,736 RAC: 0	Message 45201 - Posted: 13 Aug 2021, 17:29:25 UTC - in response to Message 45199. CMS - not tweaked? How can one tweak them and what will result? You can tweak some parameters using an app_config.xml. See this page for details: https://boinc.berkeley.edu/wiki/Client_configuration Your own app_config.xml must strictly follow the template shown there. Default values can be found in client_state.xml Mostly used for LHC@home tweaking: <max_concurrent>n</max_concurrent> <project_max_concurrent>N</project_max_concurrent> <avg_ncpus>x</avg_ncpus> # 1) The VM's RAM size can be tweaked using <cmdline>--memory_size_mb 2048</cmdline> # 2) 1) The manual explains: "...(possibly fractional) ..." but this makes no sense here since it also tells vboxwrapper how many cores it should configure for the VM. The latter only accepts integer values, hence use "x" or "x.0". 2) 2048 is the default for CMS and doesn't need to be specified here. Setting it higher would be waste of RAM since a VM never returns allocated RAM to the OS. Setting it a bit lower will slow down the VM. Setting it much lower will cause the scientific app not to run since it checks if enough RAM is available. Thank you. Will give this a look. Think I will need to as I may need more than 3 preference types. ID: 45201 · Reply Quote

Eric Mcintosh Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0	Message 45209 - Posted: 16 Aug 2021, 7:14:36 UTC - in response to Message 45183. Hi, are you running SixTrack, virtually no I/O? Thanks. Eric ID: 45209 · Reply Quote

wolfman1360 Send message Joined: 17 Feb 17 Posts: 42 Credit: 2,589,736 RAC: 0	Message 45221 - Posted: 18 Aug 2021, 1:00:06 UTC - in response to Message 45209. Hi, are you running SixTrack, virtually no I/O? Thanks. Eric I do have it selected, but there don't seem to be any tasks available currently. thanks ID: 45221 · Reply Quote

LHC@home