Message boards :
ATLAS application :
Runtime Calculation is far too high
Message board moderation
Author | Message |
---|---|
Send message Joined: 15 Jun 08 Posts: 2413 Credit: 226,446,710 RAC: 132,279 |
WU runtime calculation inside the BOINC client is based on 2 main parameters: - <rsc_fpops_est> - <flops> Both parameters are received from the server as part of a sched_reply_*.xml. <rsc_fpops_est> is provided as part of the project description. <flops> is an averaged value stored in the server DB as part of the <app_version> description per host and only slowly adapted after the client returned a couple of WUs. The most recent ATLAS WUs seem to have a <rsc_fpops_est> value that is far too high. As a result my client estimates a runtime of more than 20h per WU whereas the real runtime was about 6h (1 core). Consequences On the client the WU cache is marked as "full" although it isn´t. This disturbes the project balance in a mixed project environment. On the server it influences the app scheduling as well as the credit calculation according to the BOINC documentation. Request The project team should invest more efford to estimate correct values for <rsc_fpops_est> especially as there are - at the moment - more than 10 different types of ATLAS WUs in progress. |
Send message Joined: 13 May 14 Posts: 387 Credit: 15,314,184 RAC: 0 |
The runtime estimate is automatically determined by upstream systems using the history of completed tasks. Since some tasks take a very long time the estimate has increased to it's current high value. While we try to tune the algorithm on the ATLAS side I've put a cap of 4 hours on the runtime estimate - BOINC weights this by the CPU speed so it may be more or less than 4 hours. |
Send message Joined: 18 Dec 16 Posts: 123 Credit: 37,495,365 RAC: 0 |
I have a task with "Estimated computation size" at 43,200 GFLOPS which took "4 hours 39 min 29 sec" of CPU time to complete. But BOINC Manager estimates that it will take 17:15 minutes to run as 4-core WU: https://lhcathome.cern.ch/lhcathome/result.php?resultid=132039446 I have the impression that the "Estimated computation size" is too low. If I understand correctly, this estimation is linked to the taskID, which in this case is 10995522. The tasks I had some 2 days ago were taskID=10995530, which also took around "4 hours 30 min" of CPU time to complete, but BOINC Manager was more accurate in estimating the time to complete (around 2 hours if I remember correctly). But I don't know what was the "Estimated computation size" for those taskID=10995530. Is there a way to check from the Stderr output? We are the product of random evolution. |
Send message Joined: 15 Jun 08 Posts: 2413 Credit: 226,446,710 RAC: 132,279 |
You may (carefully! Do not write to it while BOINC is running!) check your client_state.xml file. Locate the following sections: <app_version> Divide <rsc_fpops_est> by <flops>. The result is the (initial) remaining runtime of the WU as it is displayed by your client. |
Send message Joined: 18 Dec 16 Posts: 123 Credit: 37,495,365 RAC: 0 |
Divide <rsc_fpops_est> by <flops>. Thanks. I think I found where the issue is coming from. The <flops>a very large number</flops> tag for ATLAS in LHC@Home shows a ridiculously high value of near 40 GFLOPs, where all the other applications have 10 GFLOPs or less. I am not sure where this figure came from, but it is slowly decreasing, so I guess it will adjust itself as David said. We are the product of random evolution. |
Send message Joined: 15 Jun 08 Posts: 2413 Credit: 226,446,710 RAC: 132,279 |
I´m rather sure there is some kind of "resonance vibration" as both parameters change with every calculated WU and last but not least influence also the credit calculation. BOINC documentation: Job runtime estimation |
Send message Joined: 13 May 14 Posts: 387 Credit: 15,314,184 RAC: 0 |
I now have estimated times of 12 mins on tasks which take 1 hour, so it's still not very accurate. We submit the tasks with an estimated flops of estimated time * 3*10^9, assuming the average CPU can do 3GFLOPS. As mentioned above I capped the time at 4 hours so I've no idea how this gets reduced to 12 mins. I guess it could be divided by number of cores but that still leaves a big difference. But at least the credit per task went up :) |
Send message Joined: 15 Jun 08 Posts: 2413 Credit: 226,446,710 RAC: 132,279 |
I patched one WU locally to see what happens. <rsc_fpops_est>50000000000000</rsc_fpops_est> # 50,000 GFLOPS workunit efford; over the thumb based on my previous runtimes <flops>5310000000.000000<flops> # 5.3 GFLOPS CPU performance; from CPU benchmark (rounded) <avg_ncpus>1.000000</avg_ncpus> # BOINC sees 1 CPU <cmdline>--nthreads 2</cmdline> # VM runs on 2 CPUs Calculated runtime: 9,416 seconds (2 h 37 min) Real runtime: 9,379.54 seconds (2 h 6 min) Remark: At the end the WU had still 1.5 h left. Real values from the most recent request/reply: <rsc_fpops_est>43200000000000.000000</rsc_fpops_est> # 43,200 GFLOPS workunit efford; yesterday´s WUs had more than 1,800,000 GFLOPS <flops>26160814030.021271</flops> # 26.2 GFLOPS CPU performance; estimated by the project server -> 4.9 CPUs (???) Credits https://lhcathome.cern.ch/lhcathome/result.php?resultid=132051498 9,379.54 seconds, 346.34 credits https://lhcathome.cern.ch/lhcathome/result.php?resultid=132229331 9,031.70 seconds, 1,432.91 credits (very nice!!!) Without exact knowledge of the server code and the role of the upstream systems it is not possible to make any suggestion. At least on thing is clear: There are close dependencies between estimated runtime, calculated CPU performance and credits. |
Send message Joined: 18 Dec 16 Posts: 123 Credit: 37,495,365 RAC: 0 |
BOINC Manager estimates that it will take 17:15 minutes to run as 4-core WU I now have estimated times of 12 mins on tasks which take 1 hour, so it's still not very accurate. I´m rather sure there is some kind of "resonance vibration" I think the problem (inaccurate estimation of time to complete a task) occurs when using a different number of cores in LHC@Home preferences and in the app_config file, or in the transition period after changing any of those 2 parameters. It seems that if both settings are different, the estimation remains inaccurate. We are the product of random evolution. |
Send message Joined: 15 Jun 08 Posts: 2413 Credit: 226,446,710 RAC: 132,279 |
BOINC Manager estimates that it will take 17:15 minutes to run as 4-core WU I set all core variables back to 1 core to minimise the influence. If you are right it will be problematic to run subprojects with different settings, e.g. ATLAS as 2 core and CMS as 1 core. What remains is the flops averaging that is done on the server and the runtime differences of individial WUs. |
Send message Joined: 31 May 17 Posts: 2 Credit: 3,027,429 RAC: 0 |
Hi! I have maybe a dumb question :) The estimated computation size which appears in the properties of particular task in BOINC client is the total floating-point operations needed to complete it? If yes then the unit which is used there (GFLOPs) is misleading. GFLOPs is usually used to describe the performance of particular machine (Giga Floating Point-OPerations per second), not to describe the computations size of task. I'm right? Best regards, Mroowa |
Send message Joined: 15 Jun 08 Posts: 2413 Credit: 226,446,710 RAC: 132,279 |
The estimated computation size which appears in the properties of particular task in BOINC client is the total floating-point operations needed to complete it? Not exactly. For ATLAS it is a fix average value for every task type. If yes then the unit which is used there (GFLOPs) is misleading. GFLOPs is usually used to describe the performance of particular machine (Giga Floating Point-OPerations per second), not to describe the computations size of task. I'm right? The definition that is used by BOINC can be found here: http://boinc.berkeley.edu/trac/wiki/CreditNew#Anewsystemforruntimeestimationandcredit See: FLOPs vs. FLOPS :-) |
Send message Joined: 24 Oct 04 Posts: 1127 Credit: 49,750,905 RAC: 9,376 |
|
Send message Joined: 31 May 17 Posts: 2 Credit: 3,027,429 RAC: 0 |
Ah, so this is a matter of letter size :) Well, quite confusing :P Anyway, thanks you for your answers! Best regards, Mroowa |
©2024 CERN