Thread 'Runtime Calculation is far too high'

Author	Message
computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2767 Credit: 308,082,268 RAC: 114,312	Message 29781 - Posted: 3 Apr 2017, 8:05:41 UTC WU runtime calculation inside the BOINC client is based on 2 main parameters: - <rsc_fpops_est> - <flops> Both parameters are received from the server as part of a sched_reply_.xml. <rsc_fpops_est> is provided as part of the project description. <flops> is an averaged value stored in the server DB as part of the <app_version> description per host and only slowly adapted after the client returned a couple of WUs. The most recent ATLAS WUs seem to have a <rsc_fpops_est> value that is far too high. As a result my client estimates a runtime of more than 20h per WU whereas the real runtime was about 6h (1 core). Consequences* On the client the WU cache is marked as "full" although it isnÂ´t. This disturbes the project balance in a mixed project environment. On the server it influences the app scheduling as well as the credit calculation according to the BOINC documentation. Request The project team should invest more efford to estimate correct values for <rsc_fpops_est> especially as there are - at the moment - more than 10 different types of ATLAS WUs in progress. ID: 29781 · Reply Quote

David Cameron Project administrator Project developer Project scientist Send message Joined: 13 May 14 Posts: 387 Credit: 15,314,184 RAC: 0	Message 29788 - Posted: 3 Apr 2017, 12:39:10 UTC - in response to Message 29781. The runtime estimate is automatically determined by upstream systems using the history of completed tasks. Since some tasks take a very long time the estimate has increased to it's current high value. While we try to tune the algorithm on the ATLAS side I've put a cap of 4 hours on the runtime estimate - BOINC weights this by the CPU speed so it may be more or less than 4 hours. ID: 29788 · Reply Quote

HerveUAE Send message Joined: 18 Dec 16 Posts: 123 Credit: 37,495,365 RAC: 0	Message 29795 - Posted: 3 Apr 2017, 17:53:14 UTC I have a task with "Estimated computation size" at 43,200 GFLOPS which took "4 hours 39 min 29 sec" of CPU time to complete. But BOINC Manager estimates that it will take 17:15 minutes to run as 4-core WU: https://lhcathome.cern.ch/lhcathome/result.php?resultid=132039446 I have the impression that the "Estimated computation size" is too low. If I understand correctly, this estimation is linked to the taskID, which in this case is 10995522. The tasks I had some 2 days ago were taskID=10995530, which also took around "4 hours 30 min" of CPU time to complete, but BOINC Manager was more accurate in estimating the time to complete (around 2 hours if I remember correctly). But I don't know what was the "Estimated computation size" for those taskID=10995530. Is there a way to check from the Stderr output? We are the product of random evolution. ID: 29795 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2767 Credit: 308,082,268 RAC: 114,312	Message 29797 - Posted: 3 Apr 2017, 18:26:22 UTC - in response to Message 29795. You may (carefully! Do not write to it while BOINC is running!) check your client_state.xml file. Locate the following sections: <app_version> <app_name>ATLAS</app_name> <version_num>101</version_num> <platform>x86_64-pc-linux-gnu</platform> <avg_ncpus>1.000000</avg_ncpus> <max_ncpus>1.000000</max_ncpus> <flops>a very large number</flops> <a couple of other tags> ... </app_version> <workunit> <name>Name of the workunit you are looking for<name> <app_name>ATLAS</app_name> <version_num>101</version_num> <rsc_fpops_est>another very large number</rsc_fpops_est> <a couple of other tags> ... </workunit> Divide <rsc_fpops_est> by <flops>. The result is the (initial) remaining runtime of the WU as it is displayed by your client. ID: 29797 · Reply Quote

HerveUAE Send message Joined: 18 Dec 16 Posts: 123 Credit: 37,495,365 RAC: 0	Message 29798 - Posted: 3 Apr 2017, 18:57:52 UTC Divide <rsc_fpops_est> by <flops>. Thanks. I think I found where the issue is coming from. The <flops>a very large number</flops> tag for ATLAS in LHC@Home shows a ridiculously high value of near 40 GFLOPs, where all the other applications have 10 GFLOPs or less. I am not sure where this figure came from, but it is slowly decreasing, so I guess it will adjust itself as David said. We are the product of random evolution. ID: 29798 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2767 Credit: 308,082,268 RAC: 114,312	Message 29800 - Posted: 3 Apr 2017, 19:34:59 UTC - in response to Message 29798. IÂ´m rather sure there is some kind of "resonance vibration" as both parameters change with every calculated WU and last but not least influence also the credit calculation. BOINC documentation: Job runtime estimation ... The new system has a large overlap with the new credit system ... ID: 29800 · Reply Quote

David Cameron Project administrator Project developer Project scientist Send message Joined: 13 May 14 Posts: 387 Credit: 15,314,184 RAC: 0	Message 29812 - Posted: 4 Apr 2017, 10:36:59 UTC - in response to Message 29800. I now have estimated times of 12 mins on tasks which take 1 hour, so it's still not very accurate. We submit the tasks with an estimated flops of estimated time * 3*10^9, assuming the average CPU can do 3GFLOPS. As mentioned above I capped the time at 4 hours so I've no idea how this gets reduced to 12 mins. I guess it could be divided by number of cores but that still leaves a big difference. But at least the credit per task went up :) ID: 29812 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2767 Credit: 308,082,268 RAC: 114,312	Message 29815 - Posted: 4 Apr 2017, 11:51:51 UTC I patched one WU locally to see what happens. <rsc_fpops_est>50000000000000</rsc_fpops_est> # 50,000 GFLOPS workunit efford; over the thumb based on my previous runtimes <flops>5310000000.000000<flops> # 5.3 GFLOPS CPU performance; from CPU benchmark (rounded) <avg_ncpus>1.000000</avg_ncpus> # BOINC sees 1 CPU <cmdline>--nthreads 2</cmdline> # VM runs on 2 CPUs Calculated runtime: 9,416 seconds (2 h 37 min) Real runtime: 9,379.54 seconds (2 h 6 min) Remark: At the end the WU had still 1.5 h left. Real values from the most recent request/reply: <rsc_fpops_est>43200000000000.000000</rsc_fpops_est> # 43,200 GFLOPS workunit efford; yesterdayÂ´s WUs had more than 1,800,000 GFLOPS <flops>26160814030.021271</flops> # 26.2 GFLOPS CPU performance; estimated by the project server -> 4.9 CPUs (???) Credits https://lhcathome.cern.ch/lhcathome/result.php?resultid=132051498 9,379.54 seconds, 346.34 credits https://lhcathome.cern.ch/lhcathome/result.php?resultid=132229331 9,031.70 seconds, 1,432.91 credits (very nice!!!) Without exact knowledge of the server code and the role of the upstream systems it is not possible to make any suggestion. At least on thing is clear: There are close dependencies between estimated runtime, calculated CPU performance and credits. ID: 29815 · Reply Quote

HerveUAE Send message Joined: 18 Dec 16 Posts: 123 Credit: 37,495,365 RAC: 0	Message 29834 - Posted: 5 Apr 2017, 18:52:54 UTC BOINC Manager estimates that it will take 17:15 minutes to run as 4-core WU I now have estimated times of 12 mins on tasks which take 1 hour, so it's still not very accurate. IÂ´m rather sure there is some kind of "resonance vibration" I think the problem (inaccurate estimation of time to complete a task) occurs when using a different number of cores in LHC@Home preferences and in the app_config file, or in the transition period after changing any of those 2 parameters. It seems that if both settings are different, the estimation remains inaccurate. We are the product of random evolution. ID: 29834 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2767 Credit: 308,082,268 RAC: 114,312	Message 29835 - Posted: 5 Apr 2017, 19:23:08 UTC - in response to Message 29834. BOINC Manager estimates that it will take 17:15 minutes to run as 4-core WU I now have estimated times of 12 mins on tasks which take 1 hour, so it's still not very accurate. IÂ´m rather sure there is some kind of "resonance vibration" I think the problem (inaccurate estimation of time to complete a task) occurs when using a different number of cores in LHC@Home preferences and in the app_config file, or in the transition period after changing any of those 2 parameters. It seems that if both settings are different, the estimation remains inaccurate. I set all core variables back to 1 core to minimise the influence. If you are right it will be problematic to run subprojects with different settings, e.g. ATLAS as 2 core and CMS as 1 core. What remains is the flops averaging that is done on the server and the runtime differences of individial WUs. ID: 29835 · Reply Quote

Mroowa Send message Joined: 31 May 17 Posts: 2 Credit: 3,027,429 RAC: 0	Message 34376 - Posted: 13 Feb 2018, 19:58:51 UTC Last modified: 13 Feb 2018, 20:10:46 UTC Hi! I have maybe a dumb question :) The estimated computation size which appears in the properties of particular task in BOINC client is the total floating-point operations needed to complete it? If yes then the unit which is used there (GFLOPs) is misleading. GFLOPs is usually used to describe the performance of particular machine (Giga Floating Point-OPerations per second), not to describe the computations size of task. I'm right? Best regards, Mroowa ID: 34376 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2767 Credit: 308,082,268 RAC: 114,312	Message 34377 - Posted: 13 Feb 2018, 21:40:55 UTC - in response to Message 34376. The estimated computation size which appears in the properties of particular task in BOINC client is the total floating-point operations needed to complete it? Not exactly. For ATLAS it is a fix average value for every task type. If yes then the unit which is used there (GFLOPs) is misleading. GFLOPs is usually used to describe the performance of particular machine (Giga Floating Point-OPerations per second), not to describe the computations size of task. I'm right? The definition that is used by BOINC can be found here: http://boinc.berkeley.edu/trac/wiki/CreditNew#Anewsystemforruntimeestimationandcredit See: FLOPs vs. FLOPS :-) ID: 34377 · Reply Quote

Magic Quantum Mechanic Send message Joined: 24 Oct 04 Posts: 1324 Credit: 101,887,382 RAC: 143,495	Message 34379 - Posted: 14 Feb 2018, 2:39:28 UTC - in response to Message 34377. The definition that is used by BOINC can be found here: http://boinc.berkeley.edu/trac/wiki/CreditNew#Anewsystemforruntimeestimationandcredit See: FLOPs vs. FLOPS :-) ID: 34379 · Reply Quote

Mroowa Send message Joined: 31 May 17 Posts: 2 Credit: 3,027,429 RAC: 0	Message 34380 - Posted: 14 Feb 2018, 11:50:13 UTC - in response to Message 34377. Last modified: 14 Feb 2018, 11:52:00 UTC Ah, so this is a matter of letter size :) Well, quite confusing :P Anyway, thanks you for your answers! Best regards, Mroowa ID: 34380 · Reply Quote