Message boards : ATLAS application : New app version 1.01
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · Next

AuthorMessage
Profile HerveUAE
Avatar

Send message
Joined: 18 Dec 16
Posts: 123
Credit: 37,495,365
RAC: 0
Message 29251 - Posted: 14 Mar 2017, 14:16:37 UTC

A 3-core ATLAS did well this morning, but thereafter I wanted to run a single core and that one died early.

This is an interesting observation. I always run ATLAS on 1 core. I will try on 2 to see if it makes a difference.
So, it may help, if you suspend / de-select other VM-Subprojects for a while and test, how this works.

And I will try that suggestion as well.
We are the product of random evolution.
ID: 29251 · Report as offensive     Reply Quote
Profile Yeti
Volunteer moderator
Avatar

Send message
Joined: 2 Sep 04
Posts: 437
Credit: 117,893,361
RAC: 2,337
Message 29252 - Posted: 14 Mar 2017, 14:54:54 UTC - in response to Message 29247.  

I think there may still be some other issue. All ATLAS WUs still fail on my machine, and the following 2 WUs also failed on another machine:
https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=60483240
https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=60460521
Maybe this is the root cause of a problem that lead to the servers being overloaded over the WE.

The second WU (bold) has been finished by one of my machines and it has got validated


Supporting BOINC, a great concept !
ID: 29252 · Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project scientist

Send message
Joined: 13 May 14
Posts: 374
Credit: 13,562,413
RAC: 7,602
Message 29254 - Posted: 14 Mar 2017, 15:14:13 UTC - in response to Message 29252.  
Last modified: 14 Mar 2017, 15:15:52 UTC

Yeti, you are using an app_config.xml to set the cores and memory for ATLAS, right? I wonder if the memory size is too small for one and two cores so that's why they don't succeed. I'm running 4-core tasks on my machine and have 100% success since yesterday afternoon. I'll try running one and two cores to see if I see any problems.

PS I also run only ATLAS.
ID: 29254 · Report as offensive     Reply Quote
Profile Yeti
Volunteer moderator
Avatar

Send message
Joined: 2 Sep 04
Posts: 437
Credit: 117,893,361
RAC: 2,337
Message 29256 - Posted: 14 Mar 2017, 15:21:53 UTC - in response to Message 29254.  

Yeti, you are using an app_config.xml to set the cores and memory for ATLAS, right? I wonder if the memory size is too small for one and two cores so that's why they don't succeed. I'm running 4-core tasks on my machine and have 100% success since yesterday afternoon. I'll try running one and two cores to see if I see any problems.

Yes, I'm using app_config.xml to set up cores and memory.

As my machines have plenty of RAM, I'm offering ATLAS-VM a generous RAM-Equipment:

3-Core WU: 5.000 MB
4-Core WU: 7.500 MB
5-Core WU: 7.500 MB

---------------------------------------

I'm thinking about if it could have to do something with the processor, you remember that I have one machine that could only crunch Single-Core-WUs.

Maybe I should test this one again here at Atlas@LHC


Supporting BOINC, a great concept !
ID: 29256 · Report as offensive     Reply Quote
Profile HerveUAE
Avatar

Send message
Joined: 18 Dec 16
Posts: 123
Credit: 37,495,365
RAC: 0
Message 29257 - Posted: 14 Mar 2017, 15:35:33 UTC
Last modified: 14 Mar 2017, 15:36:06 UTC

I wonder if the memory size is too small for one and two cores so that's why they don't succeed.

I am currently running 2 2-core ATLAS tasks, one with default RAM assigned by server (3400 MB) and one manually forced through app_config at 5000 MB.
We are the product of random evolution.
ID: 29257 · Report as offensive     Reply Quote
Profile HerveUAE
Avatar

Send message
Joined: 18 Dec 16
Posts: 123
Credit: 37,495,365
RAC: 0
Message 29258 - Posted: 14 Mar 2017, 15:46:47 UTC
Last modified: 14 Mar 2017, 15:51:08 UTC

The task with default memory size at 3400 MB failed with this error that may be related to a lack of memory "FATAL makePool failed":
https://lhcathome.cern.ch/lhcathome/result.php?resultid=126128746
The other task has passed the first 20 minutes, which is a good sign :).
We are the product of random evolution.
ID: 29258 · Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project scientist

Send message
Joined: 13 May 14
Posts: 374
Credit: 13,562,413
RAC: 7,602
Message 29260 - Posted: 14 Mar 2017, 16:02:18 UTC - in response to Message 29258.  

After switching to 1 core I got 100% failures :(

I was able to log into the VM and catch the log messages, indeed it is a problem of running out of memory.

I have increased the memory formula to 1.6 + 1 * ncores.
ID: 29260 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1124
Credit: 6,910,251
RAC: 1,174
Message 29262 - Posted: 14 Mar 2017, 17:14:13 UTC - in response to Message 29260.  

After switching to 1 core I got 100% failures :(

....

I have increased the memory formula to 1.6 + 1 * ncores.

All my single cores failed too. 1.4GB+1GB = 2.4GB
Now running 2 dual 1.4GB+2*1GB = 3.4GB and should fail too as like HerveUAE wrote . . . . wait .. .. .. Indeed they failed!

My successful task this morning was a 3-core with 4.4GB.

I'll use now David's new formula for dual-cores.
When they seems to succeed I'll try the single core one again with 2.6GB
ID: 29262 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1124
Credit: 6,910,251
RAC: 1,174
Message 29263 - Posted: 14 Mar 2017, 17:30:25 UTC - in response to Message 29262.  

I'll use now David's new formula for dual-cores.

3600MB for a dual core seems not enough: https://lhcathome.cern.ch/lhcathome/result.php?resultid=126128929
ID: 29263 · Report as offensive     Reply Quote
Profile HerveUAE
Avatar

Send message
Joined: 18 Dec 16
Posts: 123
Credit: 37,495,365
RAC: 0
Message 29266 - Posted: 14 Mar 2017, 18:32:16 UTC

And 3800 MB for a dual core seems not enough neither: https://lhcathome.cern.ch/lhcathome/result.php?resultid=126129022[/url]
ID: 29266 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1124
Credit: 6,910,251
RAC: 1,174
Message 29267 - Posted: 14 Mar 2017, 18:53:49 UTC - in response to Message 29266.  

ID: 29267 · Report as offensive     Reply Quote
peterfilla

Send message
Joined: 2 Jan 11
Posts: 23
Credit: 5,986,899
RAC: 0
Message 29268 - Posted: 14 Mar 2017, 19:09:22 UTC

Only 60 sec. for the checkpoint-interval ??? That should be more I think - perhaps . . . . .
ID: 29268 · Report as offensive     Reply Quote
Profile HerveUAE
Avatar

Send message
Joined: 18 Dec 16
Posts: 123
Credit: 37,495,365
RAC: 0
Message 29269 - Posted: 14 Mar 2017, 19:15:31 UTC

If I remember well, the previous formula was 1,4 GB + (NumberOfCores) * 0,8 GB, which apparently worked for 3-core tasks = 3800 MB.
Why now 4000 MB would not be sufficient for 2-core tasks? Aren't we running the same 1.01 version on the same data set?
We are the product of random evolution.
ID: 29269 · Report as offensive     Reply Quote
Profile HerveUAE
Avatar

Send message
Joined: 18 Dec 16
Posts: 123
Credit: 37,495,365
RAC: 0
Message 29270 - Posted: 14 Mar 2017, 19:20:55 UTC

My successful task this morning was a 3-core with 4.4GB.

Maybe 4.4 GB is the good value, flat for any number of cores.
We are the product of random evolution.
ID: 29270 · Report as offensive     Reply Quote
Profile Yeti
Volunteer moderator
Avatar

Send message
Joined: 2 Sep 04
Posts: 437
Credit: 117,893,361
RAC: 2,337
Message 29271 - Posted: 14 Mar 2017, 19:25:12 UTC

Perhaps you try first with my figures and then go down:

3-Core WU: 5.000 MB
4-Core WU: 7.500 MB
5-Core WU: 7.500 MB

These are well proven


Supporting BOINC, a great concept !
ID: 29271 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1124
Credit: 6,910,251
RAC: 1,174
Message 29273 - Posted: 14 Mar 2017, 20:20:19 UTC

I have finally a dual-core running with 4300MB of RAM.
I proved earlier that a 3-core VM ran with 4400MB, maybe also with 4300MB?
ID: 29273 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1124
Credit: 6,910,251
RAC: 1,174
Message 29274 - Posted: 14 Mar 2017, 21:14:53 UTC - in response to Message 29273.  

I have finally a dual-core running with 4300MB of RAM.
I proved earlier that a 3-core VM ran with 4400MB, maybe also with 4300MB?

Both 2-core and 3-core VM's will run, but the dual core is only using 1 core and the 3-core is only using 2 cores both with 4300MB RAM,
where the 3-core with 4400MB RAM was using 3 cores.
ID: 29274 · Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project scientist

Send message
Joined: 13 May 14
Posts: 374
Credit: 13,562,413
RAC: 7,602
Message 29275 - Posted: 14 Mar 2017, 21:26:02 UTC

My single core task with 2.6GB finished ok:

https://lhcathome.cern.ch/lhcathome/result.php?resultid=126128857

I am not sure the makepool error is caused by memory, all the failures I was getting were with "EVNTtoHITS got a SIGKILL signal (exit code 137)" where the kernel was killing the ATLAS process for using too much memory, eg

https://lhcathome.cern.ch/lhcathome/result.php?resultid=126128854

In general I would recommend running more cores (4 to 8) as this is more efficient. But, as others have discovered, credit is based mainly on running time so using fewer cores guarantees more credit. So it's up to you whether you care about efficiency or credit :)
ID: 29275 · Report as offensive     Reply Quote
Profile HerveUAE
Avatar

Send message
Joined: 18 Dec 16
Posts: 123
Credit: 37,495,365
RAC: 0
Message 29280 - Posted: 15 Mar 2017, 1:52:46 UTC

Both 2-core and 3-core VM's will run, but the dual core is only using 1 core and the 3-core is only using 2 cores both with 4300MB RAM,

I have a couple of 2-core WUs that completed OK with 4400 MB. The tasks were using 2 cores during their "full speed" phase of the execution:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=126131160

In general I would recommend running more cores (4 to 8) as this is more efficient.

My own observation is that 1-core are more efficient than 2 or more cores, assuming you have sufficient RAM on your machine. The reason is simple: ATLAS tasks take very long to start-up and reach their "full speed", typically a minimum of 20 minutes on my machines, and up to 30 minutes or more. During that period, all the cores allocated to the task are not working. The more cores you use, the more CPU time is un-used proportionally, and your overall productivity in the number of tasks completed in one day is decreasing.
But maybe this is because of the bandwidth I have from home, or Internet latency between my UAE ISP and the ATLAS servers.

credit is based mainly on running time so using fewer cores guarantees more credit.

This is indeed what I observed over a short period of time, but not after a week or 2. The credit allocation slowly adjusts itself to the number of cores you use. Well, at least this is what I observed at ATLAS@Home. Credit calculation at LHC@Home may be different.
We are the product of random evolution.
ID: 29280 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jun 08
Posts: 2029
Credit: 149,005,719
RAC: 121,157
Message 29287 - Posted: 15 Mar 2017, 9:24:13 UTC - in response to Message 29280.  

My standard suggestion in this case:
Think about to use a proxy, e.g. squid.

ATLAS WUs generate between 1000 and 2000 HTTP requests at startup to fill the local CVMFS or get data from the frontier caches.
A proxy with typically 128 MB RAM and 25 GB disk would serve more than 90% of those requests and 50% of the data volume.

As a result the startup times drop down to 5-12 minutes.
ID: 29287 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · Next

Message boards : ATLAS application : New app version 1.01


©2022 CERN