Message boards : ATLAS application : Runtime Calculation is far too high
Message board moderation

To post messages, you must log in.

AuthorMessage
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 223,007,210
RAC: 136,092
Message 29781 - Posted: 3 Apr 2017, 8:05:41 UTC

WU runtime calculation inside the BOINC client is based on 2 main parameters:
- <rsc_fpops_est>
- <flops>

Both parameters are received from the server as part of a sched_reply_*.xml.

<rsc_fpops_est> is provided as part of the project description.

<flops> is an averaged value stored in the server DB as part of the <app_version> description per host and only slowly adapted after the client returned a couple of WUs.

The most recent ATLAS WUs seem to have a <rsc_fpops_est> value that is far too high.
As a result my client estimates a runtime of more than 20h per WU whereas the real runtime was about 6h (1 core).


Consequences

On the client the WU cache is marked as "full" although it isn´t.
This disturbes the project balance in a mixed project environment.

On the server it influences the app scheduling as well as the credit calculation according to the BOINC documentation.


Request

The project team should invest more efford to estimate correct values for <rsc_fpops_est> especially as there are - at the moment - more than 10 different types of ATLAS WUs in progress.
ID: 29781 · Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project scientist

Send message
Joined: 13 May 14
Posts: 387
Credit: 15,314,184
RAC: 0
Message 29788 - Posted: 3 Apr 2017, 12:39:10 UTC - in response to Message 29781.  

The runtime estimate is automatically determined by upstream systems using the history of completed tasks. Since some tasks take a very long time the estimate has increased to it's current high value. While we try to tune the algorithm on the ATLAS side I've put a cap of 4 hours on the runtime estimate - BOINC weights this by the CPU speed so it may be more or less than 4 hours.
ID: 29788 · Report as offensive     Reply Quote
Profile HerveUAE
Avatar

Send message
Joined: 18 Dec 16
Posts: 123
Credit: 37,495,365
RAC: 0
Message 29795 - Posted: 3 Apr 2017, 17:53:14 UTC

I have a task with "Estimated computation size" at 43,200 GFLOPS which took "4 hours 39 min 29 sec" of CPU time to complete. But BOINC Manager estimates that it will take 17:15 minutes to run as 4-core WU:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=132039446
I have the impression that the "Estimated computation size" is too low.

If I understand correctly, this estimation is linked to the taskID, which in this case is 10995522.

The tasks I had some 2 days ago were taskID=10995530, which also took around "4 hours 30 min" of CPU time to complete, but BOINC Manager was more accurate in estimating the time to complete (around 2 hours if I remember correctly).
But I don't know what was the "Estimated computation size" for those taskID=10995530. Is there a way to check from the Stderr output?
We are the product of random evolution.
ID: 29795 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 223,007,210
RAC: 136,092
Message 29797 - Posted: 3 Apr 2017, 18:26:22 UTC - in response to Message 29795.  

You may (carefully! Do not write to it while BOINC is running!) check your client_state.xml file.
Locate the following sections:

<app_version>
<app_name>ATLAS</app_name>
<version_num>101</version_num>
<platform>x86_64-pc-linux-gnu</platform>
<avg_ncpus>1.000000</avg_ncpus>
<max_ncpus>1.000000</max_ncpus>
<flops>a very large number</flops>
<a couple of other tags> ...
</app_version>


<workunit>
<name>Name of the workunit you are looking for<name>
<app_name>ATLAS</app_name>
<version_num>101</version_num>
<rsc_fpops_est>another very large number</rsc_fpops_est>
<a couple of other tags> ...
</workunit>


Divide <rsc_fpops_est> by <flops>.
The result is the (initial) remaining runtime of the WU as it is displayed by your client.
ID: 29797 · Report as offensive     Reply Quote
Profile HerveUAE
Avatar

Send message
Joined: 18 Dec 16
Posts: 123
Credit: 37,495,365
RAC: 0
Message 29798 - Posted: 3 Apr 2017, 18:57:52 UTC

Divide <rsc_fpops_est> by <flops>.

Thanks. I think I found where the issue is coming from. The <flops>a very large number</flops> tag for ATLAS in LHC@Home shows a ridiculously high value of near 40 GFLOPs, where all the other applications have 10 GFLOPs or less.
I am not sure where this figure came from, but it is slowly decreasing, so I guess it will adjust itself as David said.
We are the product of random evolution.
ID: 29798 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 223,007,210
RAC: 136,092
Message 29800 - Posted: 3 Apr 2017, 19:34:59 UTC - in response to Message 29798.  

I´m rather sure there is some kind of "resonance vibration" as both parameters change with every calculated WU and last but not least influence also the credit calculation.

BOINC documentation:
Job runtime estimation
... The new system has a large overlap with the new credit system ...
ID: 29800 · Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project scientist

Send message
Joined: 13 May 14
Posts: 387
Credit: 15,314,184
RAC: 0
Message 29812 - Posted: 4 Apr 2017, 10:36:59 UTC - in response to Message 29800.  

I now have estimated times of 12 mins on tasks which take 1 hour, so it's still not very accurate.

We submit the tasks with an estimated flops of estimated time * 3*10^9, assuming the average CPU can do 3GFLOPS. As mentioned above I capped the time at 4 hours so I've no idea how this gets reduced to 12 mins. I guess it could be divided by number of cores but that still leaves a big difference.

But at least the credit per task went up :)
ID: 29812 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 223,007,210
RAC: 136,092
Message 29815 - Posted: 4 Apr 2017, 11:51:51 UTC

I patched one WU locally to see what happens.
<rsc_fpops_est>50000000000000</rsc_fpops_est> # 50,000 GFLOPS workunit efford; over the thumb based on my previous runtimes
<flops>5310000000.000000<flops> # 5.3 GFLOPS CPU performance; from CPU benchmark (rounded)
<avg_ncpus>1.000000</avg_ncpus> # BOINC sees 1 CPU
<cmdline>--nthreads 2</cmdline> # VM runs on 2 CPUs

Calculated runtime: 9,416 seconds (2 h 37 min)
Real runtime: 9,379.54 seconds (2 h 6 min)

Remark:
At the end the WU had still 1.5 h left.


Real values from the most recent request/reply:
<rsc_fpops_est>43200000000000.000000</rsc_fpops_est> # 43,200 GFLOPS workunit efford; yesterday´s WUs had more than 1,800,000 GFLOPS
<flops>26160814030.021271</flops> # 26.2 GFLOPS CPU performance; estimated by the project server -> 4.9 CPUs (???)



Credits

https://lhcathome.cern.ch/lhcathome/result.php?resultid=132051498
9,379.54 seconds, 346.34 credits

https://lhcathome.cern.ch/lhcathome/result.php?resultid=132229331
9,031.70 seconds, 1,432.91 credits (very nice!!!)


Without exact knowledge of the server code and the role of the upstream systems it is not possible to make any suggestion.
At least on thing is clear: There are close dependencies between estimated runtime, calculated CPU performance and credits.
ID: 29815 · Report as offensive     Reply Quote
Profile HerveUAE
Avatar

Send message
Joined: 18 Dec 16
Posts: 123
Credit: 37,495,365
RAC: 0
Message 29834 - Posted: 5 Apr 2017, 18:52:54 UTC

BOINC Manager estimates that it will take 17:15 minutes to run as 4-core WU

I now have estimated times of 12 mins on tasks which take 1 hour, so it's still not very accurate.

I´m rather sure there is some kind of "resonance vibration"

I think the problem (inaccurate estimation of time to complete a task) occurs when using a different number of cores in LHC@Home preferences and in the app_config file, or in the transition period after changing any of those 2 parameters.
It seems that if both settings are different, the estimation remains inaccurate.
We are the product of random evolution.
ID: 29834 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 223,007,210
RAC: 136,092
Message 29835 - Posted: 5 Apr 2017, 19:23:08 UTC - in response to Message 29834.  

BOINC Manager estimates that it will take 17:15 minutes to run as 4-core WU

I now have estimated times of 12 mins on tasks which take 1 hour, so it's still not very accurate.

I´m rather sure there is some kind of "resonance vibration"

I think the problem (inaccurate estimation of time to complete a task) occurs when using a different number of cores in LHC@Home preferences and in the app_config file, or in the transition period after changing any of those 2 parameters.
It seems that if both settings are different, the estimation remains inaccurate.

I set all core variables back to 1 core to minimise the influence.

If you are right it will be problematic to run subprojects with different settings, e.g. ATLAS as 2 core and CMS as 1 core.
What remains is the flops averaging that is done on the server and the runtime differences of individial WUs.
ID: 29835 · Report as offensive     Reply Quote
Mroowa

Send message
Joined: 31 May 17
Posts: 2
Credit: 3,027,429
RAC: 0
Message 34376 - Posted: 13 Feb 2018, 19:58:51 UTC
Last modified: 13 Feb 2018, 20:10:46 UTC

Hi!

I have maybe a dumb question :) The estimated computation size which appears in the properties of particular task in BOINC client is the total floating-point operations needed to complete it?
If yes then the unit which is used there (GFLOPs) is misleading. GFLOPs is usually used to describe the performance of particular machine (Giga Floating Point-OPerations per second), not to describe the computations size of task. I'm right?

Best regards,
Mroowa
ID: 34376 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 223,007,210
RAC: 136,092
Message 34377 - Posted: 13 Feb 2018, 21:40:55 UTC - in response to Message 34376.  

The estimated computation size which appears in the properties of particular task in BOINC client is the total floating-point operations needed to complete it?

Not exactly.
For ATLAS it is a fix average value for every task type.


If yes then the unit which is used there (GFLOPs) is misleading. GFLOPs is usually used to describe the performance of particular machine (Giga Floating Point-OPerations per second), not to describe the computations size of task. I'm right?

The definition that is used by BOINC can be found here:
http://boinc.berkeley.edu/trac/wiki/CreditNew#Anewsystemforruntimeestimationandcredit
See: FLOPs vs. FLOPS
:-)
ID: 34377 · Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 1114
Credit: 49,503,363
RAC: 3,820
Message 34379 - Posted: 14 Feb 2018, 2:39:28 UTC - in response to Message 34377.  


The definition that is used by BOINC can be found here:
http://boinc.berkeley.edu/trac/wiki/CreditNew#Anewsystemforruntimeestimationandcredit
See: FLOPs vs. FLOPS
:-)

ID: 34379 · Report as offensive     Reply Quote
Mroowa

Send message
Joined: 31 May 17
Posts: 2
Credit: 3,027,429
RAC: 0
Message 34380 - Posted: 14 Feb 2018, 11:50:13 UTC - in response to Message 34377.  
Last modified: 14 Feb 2018, 11:52:00 UTC

Ah, so this is a matter of letter size :) Well, quite confusing :P
Anyway, thanks you for your answers!

Best regards,
Mroowa
ID: 34380 · Report as offensive     Reply Quote

Message boards : ATLAS application : Runtime Calculation is far too high


©2024 CERN