Message boards : ATLAS application : Atlas task slowing right down near the end but still using all cores - continue?
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · Next

AuthorMessage
Peter Hucker

Send message
Joined: 12 Aug 06
Posts: 294
Credit: 2,100,100
RAC: 2,891
Message 46393 - Posted: 1 Mar 2022, 17:08:18 UTC - in response to Message 46390.  

As CP mentioned each ATLAS task processes 200 events from a pool.
It has struck me before, that changing the task's pool size to 180 or 240 events would give better divisibility.
There is no point in this, they are random sizes. You have no idea what will be happening at the end. Each worker does not do them at the same rate, they are random. It doesn't matter if it's divisible. Imagine you're a foreman with several workers. You have 200 jobs that need doing, some take 5 minutes, some half an hour, it's random. Who cares if 200 is divisible by the number of workers? That would only be important if each task took precisely the same amount of time. They don't. At the end of a 6 core Atlas, you'll have 1 idle for an unknown amount of time, 2 idle of an unknown amount of time, then 3, then 4, then 5. If you made two 3 core Atlases, each one would have 1 idle for an unknown amount of time, 2 idle for an unknown amount of time. So pretty much the same.
ID: 46393 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jun 08
Posts: 2099
Credit: 161,785,714
RAC: 131,605
Message 46394 - Posted: 1 Mar 2022, 17:10:00 UTC

I revised my earlier comments and think it can better be explained looking at the modulo (%) results:

        events
        180    200    240
threads events % threads
1       0      0      0
2       0      0      0
3       0      2      0
4       0      0      0
5       0      0      0
6       0      2      0
7       5      4      2
8       4      0      0
12      0      8      0

The values show how many events are left in the pool (long term average!) when the last full series is finished.

Nonetheless, on a 4-core CPU a 3-core setup can still be more efficient if the long term averages to process a single event are short enough.
This needs to be tested on each computer individually.
ID: 46394 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jun 08
Posts: 2099
Credit: 161,785,714
RAC: 131,605
Message 46396 - Posted: 1 Mar 2022, 17:13:03 UTC - in response to Message 46393.  

... they are random sizes.

You are looking on just 1 task but you would have to look at the long term averages.
Really huge numbers!
ID: 46396 · Report as offensive     Reply Quote
Peter Hucker

Send message
Joined: 12 Aug 06
Posts: 294
Credit: 2,100,100
RAC: 2,891
Message 46397 - Posted: 1 Mar 2022, 17:13:51 UTC - in response to Message 46394.  

You're bound to come close to the end, with one worker having just taken the last from the pool, and the other workers part way through, at different stages, you will always get idle workers at the end, there's nothing you can do about this.
ID: 46397 · Report as offensive     Reply Quote
Peter Hucker

Send message
Joined: 12 Aug 06
Posts: 294
Credit: 2,100,100
RAC: 2,891
Message 46398 - Posted: 1 Mar 2022, 17:15:30 UTC - in response to Message 46396.  
Last modified: 1 Mar 2022, 17:16:19 UTC

... they are random sizes.

You are looking on just 1 task but you would have to look at the long term averages.
Really huge numbers!
Ok, if you've looked at huge amounts of stats, I guess anything could happen. I'm surprised there's a difference though, considering the wide variance in event times. Was the wide variance I saw unusual? Are events usually pretty much the same length?

Also, do you have a figure for how much time is wated? Since it's 200 in the pool, the wasted cores at the end are probably a fraction of a percent of inefficiency.
ID: 46398 · Report as offensive     Reply Quote
Henry Nebrensky

Send message
Joined: 13 Jul 05
Posts: 162
Credit: 14,768,010
RAC: 6
Message 46399 - Posted: 2 Mar 2022, 1:32:47 UTC - in response to Message 46398.  

... they are random sizes.
You are looking on just 1 task but you would have to look at the long term averages.
Really huge numbers!
... I'm surprised there's a difference though, considering the wide variance in event times.
Even within a task, as the number of threads is reduced then each thread must run more events, and it's more likely - but not guaranteed - that they will average out across the threads.

Was the wide variance I saw unusual?
I've no idea.

Also, do you have a figure for how much time is wasted? Since it's 200 in the pool, the wasted cores at the end are probably a fraction of a percent of inefficiency.
IIRC, back when I was running 8-core native Atlas I would generally see the active threads reduce over usually 1-2 minutes, 5 if slow, for tasks of about 4 hrs total wall-clock. (There might be numbers in some ancient post here, but the laptop's tired tonight :( )
ID: 46399 · Report as offensive     Reply Quote
Peter Hucker

Send message
Joined: 12 Aug 06
Posts: 294
Credit: 2,100,100
RAC: 2,891
Message 46400 - Posted: 2 Mar 2022, 5:56:17 UTC - in response to Message 46399.  

... I'm surprised there's a difference though, considering the wide variance in event times.
Even within a task, as the number of threads is reduced then each thread must run more events, and it's more likely - but not guaranteed - that they will average out across the threads.
Even with the full 8 cores, that's 25 events per thread, which is more than enough to average things out.

Also, do you have a figure for how much time is wasted? Since it's 200 in the pool, the wasted cores at the end are probably a fraction of a percent of inefficiency.
IIRC, back when I was running 8-core native Atlas I would generally see the active threads reduce over usually 1-2 minutes, 5 if slow, for tasks of about 4 hrs total wall-clock. (There might be numbers in some ancient post here, but the laptop's tired tonight :( )
A few minutes in 4 hours is nothing.
ID: 46400 · Report as offensive     Reply Quote
Henry Nebrensky

Send message
Joined: 13 Jul 05
Posts: 162
Credit: 14,768,010
RAC: 6
Message 46402 - Posted: 2 Mar 2022, 21:23:53 UTC - in response to Message 46400.  

Even within a task, as the number of threads is reduced then each thread must run more events, and it's more likely - but not guaranteed - that they will average out across the threads.
Even with the full 8 cores, that's 25 events per thread, which is more than enough to average things out.
Doesn't that depend on the variance, which I've never studied? In any case, the aim is that the averaging overcomes the variance, which is why divisibility is important for avoiding a small number of events left over.

Also, do you have a figure for how much time is wasted? Since it's 200 in the pool, the wasted cores at the end are probably a fraction of a percent of inefficiency.
IIRC, back when I was running 8-core native Atlas I would generally see the active threads reduce over usually 1-2 minutes, 5 if slow, for tasks of about 4 hrs total wall-clock. (There might be numbers in some ancient post here, but the laptop's tired tonight :( )
A few minutes in 4 hours is nothing.
Thank you - I did put some effort into setting those machines up...
ID: 46402 · Report as offensive     Reply Quote
Henry Nebrensky

Send message
Joined: 13 Jul 05
Posts: 162
Credit: 14,768,010
RAC: 6
Message 46403 - Posted: 2 Mar 2022, 21:28:20 UTC - in response to Message 46391.  

I would vote for 240.
Actually, I'd vote for 360 - them Babylonians knew what they were doing - but if people are already struggling with compute times then it would be better to stick to low hanging fruit. :(
ID: 46403 · Report as offensive     Reply Quote
Peter Hucker

Send message
Joined: 12 Aug 06
Posts: 294
Credit: 2,100,100
RAC: 2,891
Message 46411 - Posted: 3 Mar 2022, 14:52:57 UTC - in response to Message 46402.  

Doesn't that depend on the variance, which I've never studied?
On the one I looked at there was a factor of 10 in the times for each event.

In any case, the aim is that the averaging overcomes the variance, which is why divisibility is important for avoiding a small number of events left over.
Surely average overcoming variance would mean it's just as likely to end up with an odd number?

Thank you - I did put some effort into setting those machines up...
I can't tell if that's sarcastic.
ID: 46411 · Report as offensive     Reply Quote
Peter Hucker

Send message
Joined: 12 Aug 06
Posts: 294
Credit: 2,100,100
RAC: 2,891
Message 46412 - Posted: 3 Mar 2022, 14:54:17 UTC - in response to Message 46403.  

I would vote for 240.
Actually, I'd vote for 360 - them Babylonians knew what they were doing - but if people are already struggling with compute times then it would be better to stick to low hanging fruit. :(
Are they the ones responsible for clocks? It might divide better, so quarter of an hour is a whole number of minutes, but decimal is so much easier for humans to calculate in their heads, which is why we've pretty much stopped with inches, furlongs, etc.
ID: 46412 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 1663
Credit: 94,407,441
RAC: 316,432
Message 46413 - Posted: 3 Mar 2022, 15:09:57 UTC - in response to Message 46411.  

This wrote David in his wishes for this year:
2021 has been another strange and challenging year, but thanks to you all the ATLAS experiment has been able to continue to produce more groundbreaking physics results. This year you simulated a total of 3 billion events! At 200 events per WU that's 15 million WU crunched. To put this into perspective, the total events simulated by all our worldwide computing resources was around 24 billion, so the contribution through LHC@Home is a really significant part of this.
ID: 46413 · Report as offensive     Reply Quote
Henry Nebrensky

Send message
Joined: 13 Jul 05
Posts: 162
Credit: 14,768,010
RAC: 6
Message 46414 - Posted: 3 Mar 2022, 15:30:48 UTC - in response to Message 46412.  

I would vote for 240.
Actually, I'd vote for 360 - them Babylonians knew what they were doing ...
Are they the ones responsible for clocks?
... and angles, (which is where clock faces came from?).
But, actually computezrmle was right: 240 also gets you a division by 16, for future expansion.
ID: 46414 · Report as offensive     Reply Quote
Peter Hucker

Send message
Joined: 12 Aug 06
Posts: 294
Credit: 2,100,100
RAC: 2,891
Message 46415 - Posted: 3 Mar 2022, 15:36:30 UTC - in response to Message 46413.  

This wrote David in his wishes for this year:
2021 has been another strange and challenging year, but thanks to you all the ATLAS experiment has been able to continue to produce more groundbreaking physics results. This year you simulated a total of 3 billion events! At 200 events per WU that's 15 million WU crunched. To put this into perspective, the total events simulated by all our worldwide computing resources was around 24 billion, so the contribution through LHC@Home is a really significant part of this.
Where are the other 21 billion being done?
ID: 46415 · Report as offensive     Reply Quote
Peter Hucker

Send message
Joined: 12 Aug 06
Posts: 294
Credit: 2,100,100
RAC: 2,891
Message 46416 - Posted: 3 Mar 2022, 15:36:56 UTC - in response to Message 46414.  
Last modified: 3 Mar 2022, 15:38:36 UTC

I would vote for 240.
Actually, I'd vote for 360 - them Babylonians knew what they were doing ...
Are they the ones responsible for clocks?
... and angles, (which is where clock faces came from?).
But, actually computezrmle was right: 240 also gets you a division by 16, for future expansion.
I'm yet to be convinced it actually matters. 16 cores doing 240 events would likely still end up with half the cores waiting as some events were longer. Only single core ATLAS tasks are efficient, but the amount of RAM used and the amount of disk activity to set them up negate that.
ID: 46416 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 594
Credit: 35,773,385
RAC: 18,824
Message 46417 - Posted: 3 Mar 2022, 18:59:04 UTC - in response to Message 46415.  
Last modified: 3 Mar 2022, 18:59:16 UTC

This wrote David in his wishes for this year:
2021 has been another strange and challenging year, but thanks to you all the ATLAS experiment has been able to continue to produce more groundbreaking physics results. This year you simulated a total of 3 billion events! At 200 events per WU that's 15 million WU crunched. To put this into perspective, the total events simulated by all our worldwide computing resources was around 24 billion, so the contribution through LHC@Home is a really significant part of this.
Where are the other 21 billion being done?

See here: https://lhcathome.cern.ch/lhcathome/atlas_job.php the lower graphics for the past month.
ID: 46417 · Report as offensive     Reply Quote
Peter Hucker

Send message
Joined: 12 Aug 06
Posts: 294
Credit: 2,100,100
RAC: 2,891
Message 46419 - Posted: 3 Mar 2022, 21:06:52 UTC - in response to Message 46417.  

Where are the other 21 billion being done?
See here: https://lhcathome.cern.ch/lhcathome/atlas_job.php the lower graphics for the past month.
What is Vega?
ID: 46419 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jun 08
Posts: 2099
Credit: 161,785,714
RAC: 131,605
Message 46420 - Posted: 3 Mar 2022, 21:26:52 UTC - in response to Message 46419.  

ID: 46420 · Report as offensive     Reply Quote
Peter Hucker

Send message
Joined: 12 Aug 06
Posts: 294
Credit: 2,100,100
RAC: 2,891
Message 46421 - Posted: 3 Mar 2022, 21:35:44 UTC - in response to Message 46420.  

What is Vega?
https://indico.cern.ch/event/876794/contributions/4567029/attachments/2327238/3964735/Vega%20GDB.pdf
I'd hate to see their electricity bill. Please tell me Atos isn't the same one that made disabled people in the UK commit suicide.
ID: 46421 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jun 08
Posts: 2099
Credit: 161,785,714
RAC: 131,605
Message 46422 - Posted: 3 Mar 2022, 21:44:23 UTC - in response to Message 46421.  

I'd hate to see their electricity bill.

They hired some cyclists.
As a side effect SLO won the Tour de France twice in 2020/2021.
ID: 46422 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · Next

Message boards : ATLAS application : Atlas task slowing right down near the end but still using all cores - continue?


©2022 CERN