Message boards : ATLAS application : Request for new Default RAM Setting
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jun 08
Posts: 1520
Credit: 85,588,574
RAC: 73,866
Message 36038 - Posted: 25 Jul 2018, 5:08:12 UTC

Since last year ATLAS WUs contain 200 events instead of 50.
Most of them now need an initial download (EVNT file) of 350 MB instead of 120 MB.

What did not change was the formula to calculate the default RAM setting.
This causes lots of failed WUs (on scientific level) every day.

I suggest to raise the minimum RAM requirements to at least 4800 MB and to change the RAM formula accordingly.
ID: 36038 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1306
Credit: 23,560,555
RAC: 8,363
Message 36039 - Posted: 25 Jul 2018, 5:37:00 UTC - in response to Message 36038.  

I suggest to raise the minimum RAM requirements to at least 4800 MB and to change the RAM formula accordingly.
Until this is done (may take a while, with people on vacation right now) I would strongly suggest to everybody to increase the RAM value via app_config.xml.
ID: 36039 · Report as offensive     Reply Quote
bronco

Send message
Joined: 13 Apr 18
Posts: 443
Credit: 8,438,885
RAC: 0
Message 36041 - Posted: 25 Jul 2018, 13:18:36 UTC - in response to Message 36038.  
Last modified: 25 Jul 2018, 13:20:09 UTC

Since last year ATLAS WUs contain 200 events instead of 50.
Most of them now need an initial download (EVNT file) of 350 MB instead of 120 MB.

What did not change was the formula to calculate the default RAM setting.
This causes lots of failed WUs (on scientific level) every day.

I suggest to raise the minimum RAM requirements to at least 4800 MB and to change the RAM formula accordingly.


WTF? Is that really what's going on here? They've setup tasks to fail and then, to make matters even worse, they tell everybody with failing tasks that their results verify and leave them with the impression that everything is OK?

WOW.... just... WOW!!!

No wonder nobody trusts scientists anymore. No wonder populism is on the rise. I'm beginning to wonder if we haven't been misled about the Higgs boson too! Maybe this needs to be spread around. Imagine where Fox News and Trump would go with this if they found out... sloppy CERN, failed scientists misleading BOINC volunteers to promote their own failed agenda to keep money rolling in, the Higgs is fake news, product of failed leftist, globalist, elitist, European Union style thinking, entitled CERN scientists and globalists ripping off citizen scientists to sustain the illusion that the LHC is accomplishing something.
ID: 36041 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jun 08
Posts: 1520
Credit: 85,588,574
RAC: 73,866
Message 36042 - Posted: 25 Jul 2018, 14:01:56 UTC - in response to Message 36041.  

bronco wrote:
...

Easy!
Keep cool and lean back.

Then read the whole story why the project grants credit for some type of failed WUs.
This may take a while as the arguments are spread among different threads.
No links here to give you the time to cool down.

Then try to understand that most of the WUs run without problems and the recent situation affects most likely only a "small" number of jobs with special parameter sets.


Nonetheless some measures should be implemented to avoid similar situations in the future.

Happy crunching
ID: 36042 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Nov 14
Posts: 461
Credit: 12,873,158
RAC: 19,539
Message 36043 - Posted: 25 Jul 2018, 14:36:42 UTC - in response to Message 36041.  

Maybe this needs to be spread around. Imagine where Fox News and Trump would go with this if they found out... sloppy CERN, failed scientists misleading BOINC volunteers to promote their own failed agenda to keep money rolling in, the Higgs is fake news, product of failed leftist, globalist, elitist, European Union style thinking, entitled CERN scientists and globalists ripping off citizen scientists to sustain the illusion that the LHC is accomplishing something. [/size][/color]

Anyone who knows anything about bureaucracy will recognize the symptoms. It takes a while, and a lot of push, to get anything done. (Donald and Fox News make up their own reality anyway; they don't need CERN).
ID: 36043 · Report as offensive     Reply Quote
djoser
Avatar

Send message
Joined: 30 Aug 14
Posts: 118
Credit: 9,148,119
RAC: 9,079
Message 36044 - Posted: 25 Jul 2018, 14:40:07 UTC - in response to Message 36042.  

Hi!

Short question: Can i presume that if a HITS-file is being generated and is successfully copied to the CERN-Server i have generated a scientific useful result?

Thanks and regards,
djoser.
Why mine when you can research? - GRIDCOIN - Real cryptocurrency without wasting hashes! https://gridcoin.us
ID: 36044 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jun 08
Posts: 1520
Credit: 85,588,574
RAC: 73,866
Message 36045 - Posted: 25 Jul 2018, 15:02:09 UTC - in response to Message 36044.  

Hi!

Short question: Can i presume that if a HITS-file is being generated and is successfully copied to the CERN-Server i have generated a scientific useful result?

Thanks and regards,
djoser.

Yes and maybe no
:-)

Yes:
From the perspective of this IT project.

maybe no:
A very difficult question.
Even HITS files that may contain "nonsense" may answer important scientific questions.
ID: 36045 · Report as offensive     Reply Quote
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 598
Credit: 377,989,166
RAC: 34,232
Message 36046 - Posted: 25 Jul 2018, 15:30:47 UTC

The submitter of work have the ability to configure the amount of memory used on a per "job" level so it's possible that the larger units use more resources and the lighter ones use less.

All data is useful in science.
ID: 36046 · Report as offensive     Reply Quote
bronco

Send message
Joined: 13 Apr 18
Posts: 443
Credit: 8,438,885
RAC: 0
Message 36048 - Posted: 25 Jul 2018, 15:55:48 UTC - in response to Message 36042.  

Then read the whole story why the project grants credit for some type of failed WUs.

Give them credits but don't tell them their result verifies when it does not. Why? Because you can advise volunteers all you want about creating an app_config.xml but when they see their tasks are verifying they're not going to bother with an app_info.xml, they're simply going to say to themselves "my tasks verify, that cumputezrmie guy doesn't know what he's talking about". There might be a few who look into the problem but when they read the terse, minimal, available docs regarding implementing an app_info.xml (where to put it, what to put into it) they're going to get discouraged very quickly and pass it off with a "meh, that's LHC's problem not mine". Or, maybe they'll just detach the project.

And that's if they even bother to come here and read the messages. If their tasks are verifying then they likely won't even be coming here to read.
This may take a while as the arguments are spread among different threads.
No links here to give you the time to cool down.

Already read them. I don't buy any argument that justifies misleading volunteers.

Then try to understand that most of the WUs run without problems and the recent situation affects most likely only a "small" number of jobs with special parameter sets.

I think Sean Hannity would call that politically correct talk and accuse you of virtue signaling and spreading fake news. He will say the reason you put small in quotes is because you know it's a big number not a small number. Is anybody (other than me) even trying to get a handle on how many are failing?
ID: 36048 · Report as offensive     Reply Quote
bronco

Send message
Joined: 13 Apr 18
Posts: 443
Credit: 8,438,885
RAC: 0
Message 36049 - Posted: 25 Jul 2018, 16:08:03 UTC - in response to Message 36043.  

Anyone who knows anything about bureaucracy will recognize the symptoms. It takes a while, and a lot of push, to get anything done.

Too many HEP scientists running the BOINC server. Maybe they could try some rocket scientists for a change and see if they can't do a better job.
ID: 36049 · Report as offensive     Reply Quote
bronco

Send message
Joined: 13 Apr 18
Posts: 443
Credit: 8,438,885
RAC: 0
Message 36050 - Posted: 25 Jul 2018, 16:35:14 UTC - in response to Message 36046.  

All data is useful in science.

Very true but some data is more useful than other data. Yes, the HITS file might contain nonsense but something can be learned from that. However if no HITS file then no data, not even nonsense data. The only thing to be learned from a no HITter is that the host doesn't have enough RAM or else needs an app_config.xml or the RAM allocation formula needs to be modified as per the OP. How useful is that knowledge? How many times do they have to learn that before they fix the problem and get more hosts returning the real objective... a HITS file.
ID: 36050 · Report as offensive     Reply Quote
AuxRx

Send message
Joined: 16 Sep 17
Posts: 100
Credit: 1,566,469
RAC: 1
Message 36057 - Posted: 26 Jul 2018, 8:57:37 UTC - in response to Message 36041.  

Look at how many slots are running ATLAS: https://lhcathome.cern.ch/lhcathome/atlas_job.php

Can you tell me why project scientists might have bigger issues than a few misconfigured volunteer systems? Making this forum a toxic place does not help. I've tried. It's better to move on to another project. It's a numbers game, after all.
ID: 36057 · Report as offensive     Reply Quote
bronco

Send message
Joined: 13 Apr 18
Posts: 443
Credit: 8,438,885
RAC: 0
Message 36072 - Posted: 26 Jul 2018, 18:29:30 UTC - in response to Message 36057.  

Look at how many slots are running ATLAS: https://lhcathome.cern.ch/lhcathome/atlas_job.php

Can you tell me why project scientists might have bigger issues than a few misconfigured volunteer systems?

I'm not sure what a slot is defined to be for the purposes of those graphs but I assume slot means what it means in BOINC speak... a task. Nodes is probably not the correct term but for want of a better term I'll call them nodes. The legend for the lower graph on that page is 9 rows X 5 columns = 45 "nodes" each of which might have any number of CPUs behind it. BOINC is just 1 of 45. Additionally, BOINC is 1 of 4 assigned a color that is so dark it is indistinguishable (for me) from black but if one looks at how many slots the black nodes are doing it is a relatively small percentage of the whole and BOINC is just a portion of that percentage.

One good question deserves another... Can you tell me why, since the date the events in a task quadrupled not a single admin has been able to find the 2 minutes it would take to open the config file in a text editor and tweak the RAM calculation up a little? Why is there enough time to quadruple the events but no time to adjust the RAM?

It's a numbers game, after all.

It's also a political game as in "Oh look how broke we are, we've had to beg for resources from the community. BTW, look how community minded we are letting all these noobs connect their dirty unwashed rigs to help out". Never mind the fact that once they have volunteers hooked they ignore their dependency on a proper RAM allocation which is a dependency for 99% of volunteers, a dependency fulfilled by 99% of other projects. And then telling them their useless no HITters are useful with a validated and some credits just to keep them hooked, well, that's bloody deploranble.

There are numerous projects with less funding than LHC/CERN who are far more desperate than CERN for computing power. Keeping failing hosts hooked on this project with BS when other projects are starving is totally irresponsible. Sorry if you or anybody else finds that dose of reality "toxic". No, wait, I'm not sorry. Grow a pair and deal with it.
ID: 36072 · Report as offensive     Reply Quote
gyllic

Send message
Joined: 9 Dec 14
Posts: 202
Credit: 2,533,390
RAC: 0
Message 36080 - Posted: 27 Jul 2018, 7:55:25 UTC - in response to Message 36072.  
Last modified: 27 Jul 2018, 7:57:08 UTC

...The legend for the lower graph on that page is 9 rows X 5 columns = 45 "nodes" each of which might have any number of CPUs behind it. BOINC is just 1 of 45. ... it is a relatively small percentage of the whole and BOINC is just a portion of that percentage.
The computational needs for CERN/LHC are giagantic, i.e. every single core helps, even if its portion is very small. Since CERN had not enough funds for an enormous computing center, CERN came up with the idea of the Worldwide LHC Computing Grid (WLCG). It is the most sophisticated data-taking & analysis system ever built for science, providing near real-time access to LHC data and it is also the largest computing grid on earth with over 800k cores and 170 computing centres. So yes, the portion of BOINC is rather small.

Taken from the homepage:
"Data pours out of the LHC detectors at a blistering rate. Even after filtering out 99% of it, in 2018 we're expecting to gather around 50 petabytes of data. That's 50 million gigabytes, the equivalent to nearly 15 million high-definition (HD) movies. The scale and complexity of data from the LHC is unprecedented. This data needs to be stored, easily retrieved and analysed by physicists all over the world. This requires massive storage facilities, global networking, immense computing power, and, of course, funding. CERN does not have the computing or financial resources to crunch all of the data on site, so in 2002 it turned to grid computing to share the burden with computer centres around the world. The result, the Worldwide LHC Computing Grid (WLCG), is a distributed computing infrastructure arranged in tiers – giving a community of over 10,000 physicists near real-time access to LHC data. The WLCG builds on the ideas of grid technology initially proposed in 1999 by Ian Foster and Carl Kesselman (link is external). CERN currently provides around 20% of the global computing resources."

Considering all these aspects (and plenty more which are not mentioned) you can think that the entire setup is extremely complex. Since this BOINC project is part of the WLCG, you can imagine that setting it up correctly is a challanging task (btw: as far as i know, the tasks running on this BOINC project are simulation tasks and no data analysing tasks etc.).

So comparing this BOINC project with other projects (in terms of 99% of them manage to do this and that, etc.) may be not a good idea and also wont help fixing the problems.

... open the config file in a text editor and tweak the RAM calculation up a little?...
I dont know how long it takes to adjust the RAM setting, but yes, they should solve the RAM problems.
ID: 36080 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 1005
Credit: 34,852,322
RAC: 11,603
Message 36081 - Posted: 27 Jul 2018, 8:51:08 UTC

Gyllic,
thank you for your infos!
http://wlcg.web.cern.ch/
ID: 36081 · Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project scientist

Send message
Joined: 13 May 14
Posts: 323
Credit: 10,967,805
RAC: 360
Message 36159 - Posted: 1 Aug 2018, 10:33:40 UTC - in response to Message 36081.  

Thank you all for the interesting discussion :) and also to computezrmle for bringing my attention to it.

Regarding the memory limits, most of the failing tasks fail very quickly, after 5 or 10 mins. This makes me think that it's not related to the size of the input file or number of events but more due to different (more complex) physics processes being simulated these days.

The fact that the failing jobs don't waste too much CPU mean that we don't really see them (but I know that's not much help to those of you experiencing the failures).

I have just changed the formula to 3000 MB + 0.9 MB * ncores, i.e. you need 400MB more now per task independent of number of cores.

It also seems we were giving credit a bit too liberally to hosts with these failures so I have made the validation more strict so that credit is not given for tasks which fail quickly. However if you run for a long time and then the task crashes without producing a HITS file you will still get credit.
ID: 36159 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jun 08
Posts: 1520
Credit: 85,588,574
RAC: 73,866
Message 36163 - Posted: 1 Aug 2018, 11:41:41 UTC - in response to Message 36159.  

David Cameron wrote:
Regarding the memory limits, most of the failing tasks fail very quickly, after 5 or 10 mins. This makes me think that it's not related to the size of the input file or number of events but more due to different (more complex) physics processes being simulated these days.

Hi David,

Sorry, I have to contradict this part of your statement.
Just today I'm running a test with ATLAS vbox and I forgot to adjust the RAM setting for the 2-core setup.
Thus the 1st WU failed exactly at the moment when the large input file was about to be expanded.
The 2nd WU with a RAM setting slightly above 4800MB is running perfect.
ID: 36163 · Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project scientist

Send message
Joined: 13 May 14
Posts: 323
Credit: 10,967,805
RAC: 360
Message 36166 - Posted: 1 Aug 2018, 12:37:33 UTC - in response to Message 36163.  

David Cameron wrote:
Regarding the memory limits, most of the failing tasks fail very quickly, after 5 or 10 mins. This makes me think that it's not related to the size of the input file or number of events but more due to different (more complex) physics processes being simulated these days.

Hi David,

Sorry, I have to contradict this part of your statement.
Just today I'm running a test with ATLAS vbox and I forgot to adjust the RAM setting for the 2-core setup.
Thus the 1st WU failed exactly at the moment when the large input file was about to be expanded.
The 2nd WU with a RAM setting slightly above 4800MB is running perfect.


Does it work with exactly 4800MB (now the default setting)?

You may be right that the filesize has an effect - since each task reads only a small fraction of the input file I didn't think that the entire file would be read into memory but maybe it does happen like that.
ID: 36166 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jun 08
Posts: 1520
Credit: 85,588,574
RAC: 73,866
Message 36168 - Posted: 1 Aug 2018, 13:14:26 UTC - in response to Message 36166.  

Does it work with exactly 4800MB (now the default setting)?

It seems that it highly depends on the input file size your backend systems deliver for the different job series.

During the last year there were job series with roghly 1 MB input file size as well as 100-130 MB input file size.
The recent input file size has a range between 300 MB and close to 400 MB (compressed!).

While the volunteer's hosts running a 3-core setup (or more) seem to have no problem with that input files (RAM >= 5300 MB), 1-core and 2-core setups did their job only with the smaller input files.

The 4800 MB were just a guess roughly between a 2-core and a 3-core setup to make ATLAS available on hosts with less RAM.
I did not test if 4700 MB or 4650 MB would also work.
ID: 36168 · Report as offensive     Reply Quote
Profile Yeti
Volunteer moderator
Avatar

Send message
Joined: 2 Sep 04
Posts: 406
Credit: 96,567,558
RAC: 13
Message 36170 - Posted: 1 Aug 2018, 13:59:38 UTC - in response to Message 36159.  

I have just changed the formula to 3000 MB + 0.9 MB * ncores, i.e. you need 400MB more now per task independent of number of cores.

Perhaps you can make a post in the News-Section so a lot more crunchers will notice it


Supporting BOINC, a great concept !
ID: 36170 · Report as offensive     Reply Quote
1 · 2 · Next

Message boards : ATLAS application : Request for new Default RAM Setting


©2020 CERN