1) Message boards : CMS Application : New Version 50.00 (Message 43328)
Posted 9 Sep 2020 by Pavel Hanak
Post:
Unfortunately, all Virtualbox apps have always been quite opaque when it comes to actual memory and bandwidth requirements. Boinc Manager is unable to display them and you can't google this information (I tried before I asked here). The VM console doesn't display what the workunit actually does and the Virtualbox Manager has no graphs or statistics, either. So it's very easy to become "over-enthusiastic" that way, because the average user is left to guesswork with Windows task manager or similar tools. :-/
2) Message boards : CMS Application : New Version 50.00 (Message 43313)
Posted 6 Sep 2020 by Pavel Hanak
Post:
A typical CMS subtask writes a result file of roughly 110 MB within 1-3 h (average 2 h).
This file requires about 3 min to be uploaded on your line (5 Mbit/s).
Based on the average 15 concurrently running CMS tasks require 38% of your total upload capacity.


Thanks for the info. It's not so rosy in practice though, it seems the crunching stalls until the result file is completely uploaded. And when 15+ WUs upload at the same time, each upload speed fluctuates aroud 0.3 Mbit/s (or 30 kB/s, welcome to the dialup era), so the upload takes over hour. In the meantime, other WUs complete another result and try to upload it. The end result is that only 3 or 4 WUs of the 15 actually crunch, the rest are waiting for upload. Or at least it will become stuck in this vicious cycle if some other program uses up the upload bandwidth for 20 minutes or so. I need to limit the number of CMS tasks via app_config.xml. Are you sure the CMS result data is only 55 MB/hour on average?
3) Message boards : CMS Application : New Version 50.00 (Message 43311)
Posted 6 Sep 2020 by Pavel Hanak
Post:
I've crunched a few dozen CMS version 50.00 WUs now and I've noticed they consume a lot of network bandwidth. In fact, when 15+ of them run at the same time, they completely use up 5 Mbit/s upload limit of my home connection (download is fine). What is the total download and upload size each WU generates during the entire run?
4) Message boards : LHCb Application : app_config.xml for multicore LHCb (Message 37015)
Posted 13 Oct 2018 by Pavel Hanak
Post:
Hi all, I couldn't find any readily-available app_config example for the new multicore LHCb app, so I'm posting my LHCb section which I put together from information in various threads:

<app>
    <name>LHCb</name>
    <max_concurrent>1</max_concurrent>
</app>  
  <app_version>
	<app_name>LHCb</app_name>
	<plan_class>vbox64_mt_mcore_lhcb</plan_class>
	<avg_ncpus>4.000000</avg_ncpus>
	<!--2048 MB for the first thread plus 1300 MB for each additional core. -->
	<cmdline>--memory_size_mb 5948</cmdline>
 </app_version>
5) Message boards : Theory Application : Version 263.80 doesn't respect local CPU number limit? (Message 37002)
Posted 10 Oct 2018 by Pavel Hanak
Post:
Nice math exercise, but in reality, BM reports "max CPUs used:30" when I set them to 94%. Don't ask me why.

The 30% CPU over-utilization seems about right, when I lowered the CPU limit to 85%, the real utilization hovered around 96-97% level when two Theory WUs were running. However, I'm confused by your last line - it doesn't seem the Windows Task Manager was reporting the utilization wrong, the computer got suddenly very sluggish when Theory was running. The over-utilization seems to be real, not just measurement error.

In any case, here is my app_config.xml which I got in another forum thread (and modified). For some reason, there were both cores and threads specified in it:

<app>
    <name>Theory</name>
      <max_concurrent>6</max_concurrent>
  </app>
  <app_version>
    <app_name>Theory</app_name>
    <avg_ncpus>2.0</avg_ncpus>
    <plan_class>vbox64_mt_mcore</plan_class>
    <cmdline>--nthreads 4 --memory_size_mb 1050</cmdline>
  </app_version>
6) Message boards : Theory Application : Version 263.80 doesn't respect local CPU number limit? (Message 36989)
Posted 9 Oct 2018 by Pavel Hanak
Post:
Hi all, I just want to check if anyone else has this issue. One of my machines has 16c/32t AMD TR1950X, but in Boinc Manager local preferences I limited number of CPUs to 94%, so that I will always have at least 2 threads free (for better responsivity). All other projects and apps respect it, but not the new 263.80 Theory app. Whenever Theory runs, the CPU always jumps to 100%, even though I limited it in app_config.xml to 2c/4t. Then I tried editing the app_config.xml to 1c/1t, but it didn't help, at least not for the WUs in progess. I will see if it helps for new WUs, though that may take a while, because Theory doesn't have much work at the moment. This particular machine runs W10 x64 Pro and Boinc 7.12.1.
7) Message boards : Theory Application : Theory not utilizing all cores (Message 36791)
Posted 20 Sep 2018 by Pavel Hanak
Post:
Awesome, thanks! The memory limit is fixed to 750 MB for Theory? There is no need to adjust it with the number of CPUs/threads, like in ATLAS?
8) Message boards : Theory Application : Theory not utilizing all cores (Message 36786)
Posted 20 Sep 2018 by Pavel Hanak
Post:
Hi all, one of my machines has 16c/32t AMD TR1950X and I've also noticed that the new 263.70 multicore app severely under-utilizes the CPU. What's even worse, all tasks I've watched failed at about halfway of the computation. So naturally, I want to try to limit the number of cores as suggested here, but limiting them globally via the web interface seems rather cumbersome to me. In fact, I fine-tune them according to core and memory capacity on each my machine. Isn't there a way to limit them for Theory via app_config.xml file, like it can be done for ATLAS?

https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4161&postid=35921

Anyway, at the moment I've "fixed" it by manually aborting all 263.70 tasks, in hopes that the announced 263.80 version will work better.
9) Message boards : Number crunching : Checklist Version 3 for Atlas@Home (and other VM-based Projects) on your PC (Message 32527)
Posted 26 Sep 2017 by Pavel Hanak
Post:
So, the "CPU time at last checkpoint" counter doesn't work in 7.8.x? Does it happen for all tasks or only ones running on VirtualBox?

In any case, it's bad news for me, because I use that counter to check for hanged/broken tasks. Oh well.
10) Message boards : Number crunching : Checklist Version 3 for Atlas@Home (and other VM-based Projects) on your PC (Message 32475)
Posted 19 Sep 2017 by Pavel Hanak
Post:
Does anyone have experience with newer BOINC Manager/Client 7.8.x for Atlas and other LHC tasks? Are there some advantages or disadvantages over 7.6.x?
11) Message boards : LHCb Application : Upload Size (Message 30624)
Posted 4 Jun 2017 by Pavel Hanak
Post:
Yeah, this is what irks me about LHCb, because I have slow upload speed (50 KB/s) on one of my machines. In essence, LHCb wastes CPU time this way. In my case, it would have been much more efficient if LHCb downloaded data for the next cycle and continued computation while the upload is taking place.

I hope the scientists will add this feature later...
12) Message boards : LHCb Application : LHCb app transfers data while WUs are running? (Message 29639)
Posted 25 Mar 2017 by Pavel Hanak
Post:
I see. What happens when several LHCb apps try to download/upload data, but it takes very long due to ISP speed limits? I have one machine on ADSL with lousy 50 KB/s upload, but that machine routinely runs 8 LHCb apps at once (it has 24 GB RAM). Sometimes they upload for 30 minutes or more. I suspect the slow transfer speeds may be responsible for some WUs failing. Are there some timeouts in the app which may cause such failures?

BTW, I haven't seen any multicore LHCb WUs on my machines yet, I didn't even know they exist. But my machines have already completed a few multicore ATLAS ones (I used to crunch ATLAS even before the merger with LHC).
13) Message boards : LHCb Application : LHCb app transfers data while WUs are running? (Message 29636)
Posted 25 Mar 2017 by Pavel Hanak
Post:
Hi all, I've started to crunch WUs for LHCb app recently and I've noticed that some of them generate a lot of Internet traffic while they're running. I mostly notice it when they're uploading something, because I have rather slow upload on one of my machines. It seems that address 128.142.142.167 is used most often, but at least dozen other addresses appear in my logs, too. Some WUs transfer dozens of megabytes like that, others need much less. Is that normal? If so, I'm actually curious what exactly are the WUs downloading or uploading. Is it some auxilliary data needed for the simulations?



©2024 CERN