Message boards : ATLAS application : Some Validate errors
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · Next

AuthorMessage
Jim1348

Send message
Joined: 15 Nov 14
Posts: 602
Credit: 24,371,321
RAC: 0
Message 29852 - Posted: 6 Apr 2017, 21:45:51 UTC - in response to Message 29848.  
Last modified: 6 Apr 2017, 21:48:17 UTC

Hi , Jim , when i watch at your task list , i notice that some of them finished at the same time curiously :

Philippe,

I think that just means that they were uploaded at the same time. They actually finished on my machine a few minutes apart, as you can see from the BoincTasks History log:

Ga8LDmCc3FqnSu7Ccp2YYBZmABFKDmABFKDm3INKDmYMSKDmcEz7Ao_0 04:06:47 (03:47:52) 4/6/2017 6:50:43 AM 4/6/2017 7:36:29 AM
JMhKDmE50FqnSu7Ccp2YYBZmABFKDmABFKDm3INKDmAHSKDmK2i9im_0 04:00:06 (03:41:22) 4/6/2017 6:31:38 AM 4/6/2017 7:36:29 AM

So I don't think there is a problem.
Regards
ID: 29852 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,380,053
RAC: 102,119
Message 29854 - Posted: 7 Apr 2017, 5:10:50 UTC

since last night, none of the finished and uploaded tasks are being validated - on the task page, the column "points" says "pending".

what's the problem?
ID: 29854 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,380,053
RAC: 102,119
Message 29861 - Posted: 7 Apr 2017, 10:27:06 UTC - in response to Message 29854.  
Last modified: 7 Apr 2017, 10:27:19 UTC

what's the problem?

problem obviously solved - all uploaded tasks were validated
ID: 29861 · Report as offensive     Reply Quote
PHILIPPE

Send message
Joined: 24 Jul 16
Posts: 88
Credit: 239,917
RAC: 0
Message 29866 - Posted: 7 Apr 2017, 16:26:41 UTC - in response to Message 29861.  
Last modified: 7 Apr 2017, 16:28:11 UTC

I found my mistake ,

the time in the task list focus on the validation moment by the server.

The wu is downloaded from the server with a sending time, given in task list.
The wu starts at another time in boinc client ; the starting time given in the log.
The wu is crunched during the elapsed time.
The wu ends at the ending time, given in the log.
The wu is uploaded to the server.
The wu is stored on the server side, untill the project update by boinc client.
The wus are sorted and grouped by host and validated at the time, given in the task list.

So it seems normal that several wus can be validated at the same time if the time laps between two project updates with boinc client and the server is longer than the duration of a wu for a small host or for delayed wus for a big host with many simultaneous tasks.
(Sorry , but i never pass out).
I just wanted to understand if the fact that several wus , validated at the same time , would have an influence on the credit earned, with credit new system whose rules are dynamics,according to the size of the hosts used and the difficulty of some jobs given particularly.
ID: 29866 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2071
Credit: 156,128,280
RAC: 105,358
Message 31801 - Posted: 4 Aug 2017, 5:50:22 UTC

Four PC have problems with start of this workunit:

https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=73498862
ID: 31801 · Report as offensive     Reply Quote
Profile Yeti
Volunteer moderator
Avatar

Send message
Joined: 2 Sep 04
Posts: 453
Credit: 193,369,412
RAC: 10,065
Message 31805 - Posted: 4 Aug 2017, 8:45:47 UTC - in response to Message 31801.  

Four PC have problems with start of this workunit:

https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=73498862

One of these machines is my TOP-Cruncher, so I'm shure it is not the machine.

Either a problem of the WU or with a Server-Backend or both


Supporting BOINC, a great concept !
ID: 31805 · Report as offensive     Reply Quote
mrchips

Send message
Joined: 16 May 14
Posts: 15
Credit: 7,343,729
RAC: 0
Message 31870 - Posted: 7 Aug 2017, 12:44:47 UTC

ALL my tasks keep failing with Validate error - WHY
ID: 31870 · Report as offensive     Reply Quote
Profile Yeti
Volunteer moderator
Avatar

Send message
Joined: 2 Sep 04
Posts: 453
Credit: 193,369,412
RAC: 10,065
Message 31871 - Posted: 7 Aug 2017, 12:52:26 UTC - in response to Message 31870.  

ALL my tasks keep failing with Validate error - WHY

Take a walk through my checklist and keep an eye especially on Point Nr 2


Supporting BOINC, a great concept !
ID: 31871 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,939,826
RAC: 137,467
Message 31872 - Posted: 7 Aug 2017, 13:38:31 UTC - in response to Message 31870.  

mrchips wrote:
ALL my tasks keep failing with Validate error - WHY

The recent batch may probably need more than the automatically configured 4200 MB RAM for a 2-core WU.
You may set 5000 MB via app_config.xml.

Beside that it's of course a good advise to check your system according to Yeti's checklist.
ID: 31872 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,380,053
RAC: 102,119
Message 31882 - Posted: 7 Aug 2017, 19:17:46 UTC - in response to Message 31872.  

The recent batch may probably need more than the automatically configured 4200 MB RAM for a 2-core WU.
You may set 5000 MB via app_config.xml.

hm, in view of the fact that currently even a 1-core ATLAS WU uses up almost 5000 MB RAM, I would guess that a 2-core WU needs a value beyond this.
I might try 2-core WUs tomorrow, so I'll see.
ID: 31882 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Nov 14
Posts: 602
Credit: 24,371,321
RAC: 0
Message 31885 - Posted: 7 Aug 2017, 21:14:10 UTC - in response to Message 31882.  
Last modified: 7 Aug 2017, 21:16:54 UTC

The recent batch may probably need more than the automatically configured 4200 MB RAM for a 2-core WU.
You may set 5000 MB via app_config.xml.

hm, in view of the fact that currently even a 1-core ATLAS WU uses up almost 5000 MB RAM, I would guess that a 2-core WU needs a value beyond this.
I might try 2-core WUs tomorrow, so I'll see.

I set my RAM for 5000 MB via the app_config.xml, and also switched to two CPU cores per task, and no problems since then (all errors before).
https://lhcathome.cern.ch/lhcathome/results.php?hostid=10477864&offset=0&show_names=0&state=0&appid=14

In BoincTasks, the real memory usage shows up as 4200 MB for the two-core task.
ID: 31885 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,939,826
RAC: 137,467
Message 31888 - Posted: 8 Aug 2017, 6:22:05 UTC

As of my understanding there is a RAM usage peak during the initialisation phase of the ATLAS VM.
Some of the startup scripts fail if there is not enough RAM and a watchdog shuts down the VM.

During the calculation phase the RAM requirement seems to be much lower.

VMs that are configured to use more CPU cores have enough RAM to also cover the initialisation phase but
1-core (sometimes 2-core) VMs need a special RAM management, probably during batch generation.

But, that's only a guess derived from the error logs.



It would be nice if somebody from the project team could post an explanation, e.g. if there is a checklist for batch generation or some quality checks.
ID: 31888 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Nov 14
Posts: 602
Credit: 24,371,321
RAC: 0
Message 31890 - Posted: 8 Aug 2017, 6:41:01 UTC - in response to Message 31888.  
Last modified: 8 Aug 2017, 6:51:36 UTC

VMs that are configured to use more CPU cores have enough RAM to also cover the initialisation phase but
1-core (sometimes 2-core) VMs need a special RAM management, probably during batch generation.

I am trying the single CPU core tasks again (still with 5000 MB memory). The reason is that the two-core tasks do not honor the core reservation that I have set up for my GPU in an app_config.xml for GPU Grid. That is, ATLAS will run four tasks and once, while the GPU also uses a core, so that nine cores are being allocated. That results in slow running on the GPU due to CPU starvation, not a desirable condition.

If I encounter memory problems with this configuration, I can limit the number of ATLAS tasks via an app_config.xml, but that may lead to scheduling problems as you probably know, and I would prefer to avoid it. But unless several ATLAS tasks are in start-up phase all at the same time, it should not be a problem with 32 GB of memory.
ID: 31890 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,380,053
RAC: 102,119
Message 31893 - Posted: 8 Aug 2017, 16:14:15 UTC - in response to Message 31888.  
Last modified: 8 Aug 2017, 16:14:49 UTC

computezrmle wrote:
As of my understanding there is a RAM usage peak during the initialisation phase of the ATLAS VM.
Some of the startup scripts fail if there is not enough RAM and a watchdog shuts down the VM.

During the calculation phase the RAM requirement seems to be much lower.

hm, honestly, I have never made this observation, although having run 1-core, 2-core and 3-core ATLAS tasks during the course of time for more about 1 1/2 years now.

What I watch is that there is a rather quick increase in RAM usage right from the beginning on, than the increase becomes more flat, and stays at a pretty much even level till the task is finished.
ID: 31893 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,380,053
RAC: 102,119
Message 31894 - Posted: 8 Aug 2017, 16:34:11 UTC - in response to Message 31890.  

Jim1348 wrote:
I am trying the single CPU core tasks again (still with 5000 MB memory). The reason is that the two-core tasks do not honor the core reservation that I have set up for my GPU in an app_config.xml for GPU Grid. That is, ATLAS will run four tasks and once, while the GPU also uses a core, so that nine cores are being allocated. That results in slow running on the GPU due to CPU starvation, not a desirable condition.

hm, this sounds interesting.
I am also running LHC tasks and GPUGRID tasks concurrently on 2 of my PCs, but never had any problems with CPU core Reservation/allocation.

Only for LHC I use various app_config.xml's (depending on what configuration I'l like to run), but for GPUGRID I never had an app_config.xml.

For example, my main PC:
12-core CPU (= 6 core + 6 HT), 32 GB RAM.
2 GPUs, 1 GPUGRID task running on each (totally automatically, no need for an app_config.xml)
Currently, 4 2-core ATLAS tasks also running (app_config.xml for higher RAM).

This uses a total of 10 CPU cores out of my 12 cores, percentage of total CPU usage is shown as ~ 86%. Total RAM usage (once the ATLAS tasks are running beyond startup phase): ~ 22,5 GB.

What surprises me, regarding RAM usage, is the following:
When I ran 1-core ATLAS tasks (till yesterday), I had to increase RAM availability per task to 5000MB (via app_config.xml), otherwise the tasks failed after 10-14 minutes. From what I could measure, each task indeed used close to 5000MB.
Now the 2-core ATLAS tasks use not more than slightly above 5000MB each, whereas I would have expected (based on how much the 1-core tasks had needed) some value between 7000 and 7500MB. This is really strange.
ID: 31894 · Report as offensive     Reply Quote
Profile HerveUAE
Avatar

Send message
Joined: 18 Dec 16
Posts: 123
Credit: 37,495,365
RAC: 0
Message 31895 - Posted: 8 Aug 2017, 16:51:26 UTC

The initialisation and end phases of ATLAS tasks are single core, while the calculation phase is multicore. If the highest RAM requirements are from the initialisation phase, then it is logical to have similar RAM requirements for 1-core and 2-core tasks.
We are the product of random evolution.
ID: 31895 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,939,826
RAC: 137,467
Message 31896 - Posted: 8 Aug 2017, 17:15:20 UTC

Erich56 wrote:
What I watch is that there is a rather quick increase in RAM usage right from the beginning on, than the increase becomes more flat, and stays at a pretty much even level till the task is finished.

There are different perspectives.

(1) from outside the VM if you keep an eye on the host's monitoring apps
(2) from inside the VM if you look at the top console (which we still don't have in ATLAS)

IIRC (1) shows the maximum amount of RAM a VM has allocated since startup as VirtualBox would never give it back to the OS.
I have no evidence for my previous guess. It's all speculation but it's obvious that (much) more RAM given to the VM avoids errors with the recent batch.


HerveUAE wrote:
The initialisation and end phases of ATLAS tasks are single core, while the calculation phase is multicore. If the highest RAM requirements are from the initialisation phase, then it is logical to have similar RAM requirements for 1-core and 2-core tasks.

Good point.
ID: 31896 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Nov 14
Posts: 602
Credit: 24,371,321
RAC: 0
Message 31897 - Posted: 8 Aug 2017, 17:25:29 UTC - in response to Message 31894.  
Last modified: 8 Aug 2017, 17:38:28 UTC

I am also running LHC tasks and GPUGRID tasks concurrently on 2 of my PCs, but never had any problems with CPU core Reservation/allocation.

Only for LHC I use various app_config.xml's (depending on what configuration I'l like to run), but for GPUGRID I never had an app_config.xml.

For example, my main PC:
12-core CPU (= 6 core + 6 HT), 32 GB RAM.
2 GPUs, 1 GPUGRID task running on each (totally automatically, no need for an app_config.xml)
Currently, 4 2-core ATLAS tasks also running (app_config.xml for higher RAM).

This uses a total of 10 CPU cores out of my 12 cores, percentage of total CPU usage is shown as ~ 86%. Total RAM usage (once the ATLAS tasks are running beyond startup phase): ~ 22,5 GB.

In your case, it doesn't really matter if you use an app_config.xml for GPU Grid, since you have four extra cores to feed the two cards.

However, if LHC (or any other project) were allowed to run on all the cores, then they would use all 12 cores. By default, GPU Grid does not reserve a whole core for itself, so it would have to run on only a partial core, slowing it down. In that case, you normally use an app_config.xml to reserve a whole core for each GPU for best performance.
ID: 31897 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2071
Credit: 156,128,280
RAC: 105,358
Message 33131 - Posted: 25 Nov 2017, 10:18:18 UTC

This Workunit doesn't finished for more than four PC's:
https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=80187546
ID: 33131 · Report as offensive     Reply Quote
gyllic

Send message
Joined: 9 Dec 14
Posts: 202
Credit: 2,533,875
RAC: 0
Message 33133 - Posted: 25 Nov 2017, 13:36:19 UTC

ID: 33133 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · Next

Message boards : ATLAS application : Some Validate errors


©2024 CERN