Message boards : ATLAS application : all ATLAS tasks fail after about 10 minutes
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · Next

AuthorMessage
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,349,726
RAC: 101,646
Message 31698 - Posted: 29 Jul 2017, 15:13:04 UTC

would be nice if we had some detailed description somewhere as to what "error Code 65" is.

At any rate: something was changed back at the server at CERN two days ago, by which a small number of crunchers is affected.

Too bad - I have been crunching ATLAS for more than 1 year, now it's over with :-(
ID: 31698 · Report as offensive     Reply Quote
Profile HerveUAE
Avatar

Send message
Joined: 18 Dec 16
Posts: 123
Credit: 37,495,365
RAC: 0
Message 31700 - Posted: 29 Jul 2017, 16:51:12 UTC - in response to Message 31697.  

Hi Jim,

Looking at the log of the tasks on your machine, I can see 2 traces of interest:
Setting Memory Size for VM. (3400MB)

You need more than 3400 Mbytes to run Atlas tasks. You use the default RAM setting for 1 CPU tasks, which is not enough. You should write your own app_config.xml file to overwrite the default setting and set to, say, 5000MB.

FATAL makePool failed

From my own experience, this error occurs when you do not have enough RAM allocated, confirming the above observation.
We are the product of random evolution.
ID: 31700 · Report as offensive     Reply Quote
Profile HerveUAE
Avatar

Send message
Joined: 18 Dec 16
Posts: 123
Credit: 37,495,365
RAC: 0
Message 31701 - Posted: 29 Jul 2017, 16:58:38 UTC - in response to Message 31698.  

would be nice if we had some detailed description somewhere as to what "error Code 65" is.

Hi Erich,

When looking at the stderr with error code 65, look at the line that starts with
WARNING Transform now exiting early with exit code 65

At the very end of this line (you need to scroll all the way to the right), there are some details that can be of help. I saw the same error as for Jim:
FATAL makePool failed

Which is the same root cause: not enough RAM allocated (the default 3400MB is not sufficient).

Hoping it helps.
We are the product of random evolution.
ID: 31701 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 674
Credit: 43,150,492
RAC: 15,942
Message 31702 - Posted: 29 Jul 2017, 18:14:48 UTC - in response to Message 31700.  

Hi Jim,

Looking at the log of the tasks on your machine, I can see 2 traces of interest:
Setting Memory Size for VM. (3400MB)

You need more than 3400 Mbytes to run Atlas tasks. You use the default RAM setting for 1 CPU tasks, which is not enough. You should write your own app_config.xml file to overwrite the default setting and set to, say, 5000MB.

FATAL makePool failed

From my own experience, this error occurs when you do not have enough RAM allocated, confirming the above observation.


Here is a task that has also the error 65 and FATAL makePool failed: https://lhcathome.cern.ch/lhcathome/result.php?resultid=153342993

This was run with memory setting 4500 (single core). The task was validated OK. I have not increased the memory to 5000 MB
ID: 31702 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,349,726
RAC: 101,646
Message 31703 - Posted: 29 Jul 2017, 18:24:51 UTC

HerveUAE, thanks for the analysis/comparison of Jim's and my stderr, with the result that there is not enough RAM per task.

Seems very interesting and almost unbelievable, because I have been crunching these 1-core ATLAS tasks for several months, without any problems.

The fact that lack of memory is the reason for my problem would imply that as of 2 days ago, there was a change in the RAM requirement of the ATLAS tasks.
It this was the case, would there not be many more crunchers be affected (I doubt that so many crunchers had implemented an app_config for more RAM to begin with, as this was not neccessary - when I was crunching 2-core ATLAS tasks, I needed to increase the RAM per task via app_config, but never for 1-core tasks).
ID: 31703 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,349,726
RAC: 101,646
Message 31704 - Posted: 29 Jul 2017, 19:25:19 UTC

Okay, I wanted to try it with a higher memory setting (5000 MB) via app_config.xml.
However, whereas yesterday the download of an ATLAS task took about half an hour, but finally succeeded, today I was NOT able to download a new task.
The notice in the BOINC Manager is as follows:

29/07/2017 21:11:25 | LHC@home | Started download of jf_ee87d047116fce70cf9b9e5221a84fc6
29/07/2017 21:11:48 | LHC@home | Temporarily failed download of jf_ee87d047116fce70cf9b9e5221a84fc6: connect() failed
29/07/2017 21:11:48 | LHC@home | Backing off 00:14:12 on download of jf_ee87d047116fce70cf9b9e5221a84fc6
29/07/2017 21:11:51 | | Project communication failed: attempting access to reference site
29/07/2017 21:11:53 | | Internet access OK - project servers may be temporarily down.

Even retrying several times did not help.
So, this is definite proof that there is a major connectivity problem. They must have made some change to their server two days ago.
Once again, just FYI: I did NOT make any changes, neither in hardware nor in software. And CMS tasks are running well as usual.
ID: 31704 · Report as offensive     Reply Quote
Profile HerveUAE
Avatar

Send message
Joined: 18 Dec 16
Posts: 123
Credit: 37,495,365
RAC: 0
Message 31705 - Posted: 29 Jul 2017, 19:40:46 UTC

Hi Jim, Erich,

I have tried several memory settings over time. My own experience is that the RAM requirement is not fix and varies from one ATLAS task to another. The higher the allocated RAM, the lower is the probability that the task will fail. However, from time to time, one task out of many will fail. I personally think it does not depend on the number of allocated cores, but on the ATLAS algorithm itself. And it could very well be that a given set of tasks has a higher RAM requirement than other sets.

I personally have set the RAM to 7000MB and very seldom have issues related to a lack of memory. My laptop has only 8Gbytes so I could allocate only 5800MB to ATLAS. In recent days, I have not had any memory related problems on that machine.

There was some extensive tests and discussions in this thread:
https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4146#29171 where 5000MB was suggested as a good minimum.

Try increasing progressively and see if it helps.
We are the product of random evolution.
ID: 31705 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,349,726
RAC: 101,646
Message 31706 - Posted: 29 Jul 2017, 19:50:09 UTC - in response to Message 31705.  

Thanks, HerveUAE.
However, as I wrote above, I now am not able to try any RAM setting whatsoever, not being able to connect to the ATLAS download server.
ID: 31706 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Nov 14
Posts: 602
Credit: 24,371,321
RAC: 0
Message 31708 - Posted: 29 Jul 2017, 21:02:58 UTC - in response to Message 31705.  
Last modified: 29 Jul 2017, 21:04:19 UTC

Hi Jim, Erich,

I have tried several memory settings over time. My own experience is that the RAM requirement is not fix and varies from one ATLAS task to another. The higher the allocated RAM, the lower is the probability that the task will fail. However, from time to time, one task out of many will fail. I personally think it does not depend on the number of allocated cores, but on the ATLAS algorithm itself. And it could very well be that a given set of tasks has a higher RAM requirement than other sets.

I personally have set the RAM to 7000MB and very seldom have issues related to a lack of memory. My laptop has only 8Gbytes so I could allocate only 5800MB to ATLAS. In recent days, I have not had any memory related problems on that machine.

There was some extensive tests and discussions in this thread:
https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4146#29171 where 5000MB was suggested as a good minimum.

Try increasing progressively and see if it helps.


HerveUAE,

Thanks. I remember that discussion of a few months ago, and in fact I did use an app_config.xml at that time to fix the problem.
I had sort of assumed that was no longer necessary, and had forgotten about it, but thanks for the reminder.
ID: 31708 · Report as offensive     Reply Quote
Greger

Send message
Joined: 9 Jan 15
Posts: 151
Credit: 431,596,822
RAC: 0
Message 31709 - Posted: 29 Jul 2017, 21:18:32 UTC
Last modified: 29 Jul 2017, 21:21:42 UTC

Looks like Atlas servers got some issue, getting problem to download task.

For some of those that are running i´m not able to open VM console, those task that i could open show no event done.

Could we get a status what issue could be?

(this only related to Atlas, and all task is not effected to this.)
ID: 31709 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,349,726
RAC: 101,646
Message 31741 - Posted: 31 Jul 2017, 13:31:25 UTC
Last modified: 31 Jul 2017, 13:34:03 UTC

I was able to download an ATLAS task. So at least, they got this problem solved.

However, the other problem still remains: the task failed after 16 minutes: error Code 65.
Same what I had some 2 days before the download problems started.

The stderr can can be seen here:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=153369828
ID: 31741 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1268
Credit: 8,421,616
RAC: 2,139
Message 31742 - Posted: 31 Jul 2017, 14:41:59 UTC - in response to Message 31741.  

However, the other problem still remains: the task failed after 16 minutes: error Code 65.
Same what I had some 2 days before the download problems started.

The stderr can can be seen here:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=153369828

Erich, could you try a (much) newer VBox version: https://www.virtualbox.org/wiki/Downloads
ID: 31742 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,349,726
RAC: 101,646
Message 31743 - Posted: 31 Jul 2017, 14:55:06 UTC - in response to Message 31701.  
Last modified: 31 Jul 2017, 15:35:01 UTC

I now followed the advice of HerveUAE:

Hi Erich,

When looking at the stderr with error code 65, look at the line that starts with
WARNING Transform now exiting early with exit code 65

At the very end of this line (you need to scroll all the way to the right), there are some details that can be of help. I saw the same error as for Jim:
FATAL makePool failed

Which is the same root cause: not enough RAM allocated (the default 3400MB is not sufficient).

Hoping it helps.


and indeed, after I increased the memory to 5.000MB via app_config.xml, all works fine now.
So, obviously, for the first time, the 1-core ATLAS tasks need more memory than allocated as per the BOINC standard. Why so?
ID: 31743 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 674
Credit: 43,150,492
RAC: 15,942
Message 31744 - Posted: 31 Jul 2017, 17:58:02 UTC

I raised the memory allocation to 6000 MB and still got the Non-zero return code from EVNTtoHITS (65) (Error code 65) and FATAL makePool failed for AthMpEvtLoopMgr.SharedEvtQueueProvider. Task is here https://lhcathome.cern.ch/lhcathome/result.php?resultid=153371975 it was very short (16 minutes CPU time) without the HITS file but it was validated.

Anyway one task (https://lhcathome.cern.ch/lhcathome/result.php?resultid=153354120) did finish OK with HITS file before that short one.

So not satisfied with the memory explanation.
ID: 31744 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,349,726
RAC: 101,646
Message 31745 - Posted: 31 Jul 2017, 19:22:00 UTC - in response to Message 31744.  

I raised the memory allocation to 6000 MB and still got the Non-zero return code from EVNTtoHITS (65) (Error code 65) and [b]FATAL makePool failed

this is strange, indeed.

Here, the two ATLAS tasks which I downloaded and startet about 5 1/2 hours ago, are still running fine.
However, two things are different compared to before:
- after 5 1/2 hours, progress is indicated with only 55%, so the tasks seem to run longer now (before, it was between 6 1/2 and 7 hours)
- the VM console opens, but after clicking "Alt" "F2", the usual information is NOT shown.
ID: 31745 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,349,726
RAC: 101,646
Message 31771 - Posted: 1 Aug 2017, 13:18:02 UTC

from what I can see so far, the latest ATLAS tasks (since July 26) use up almost 1 GB more memory compared to the ones before.
ID: 31771 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,911,701
RAC: 138,045
Message 31772 - Posted: 1 Aug 2017, 14:41:33 UTC

At least the download issues regarding boincai04 seem to be solved.
Nonetheless the remaining errors mentioned here are still persistent.

I think it makes no sense to stay at ATLAS until the errors are sorted out.
I will change my setup to run the other subprojects for a while.
ID: 31772 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,349,726
RAC: 101,646
Message 31773 - Posted: 1 Aug 2017, 15:43:20 UTC - in response to Message 31772.  
Last modified: 1 Aug 2017, 15:43:50 UTC

At least the download issues regarding boincai04 seem to be solved.
Nonetheless the remaining errors


From my recent experience, I can say that Harry Liljeroos was right with his posting:

Looking at the log of the tasks on your machine, I can see 2 traces of interest:
Setting Memory Size for VM. (3400MB)

You need more than 3400 Mbytes to run Atlas tasks. You use the default RAM setting for 1 CPU tasks, which is not enough. You should write your own app_config.xml file to overwrite the default setting and set to, say, 5000MB.

FATAL makePool failed

From my own experience, this error occurs when you do not have enough RAM allocated, confirming the above observation.


After I had increased the memory to 5000MB (via app_config.xml), all tasks that were downloaded succeeded.

However, as I wrote in another thread this afternoon, each 1-core task now takes at least 1GB more RAM than before.
As the default setting is 3400MB, any such task is bound to fail, unless the RAM is increased "manually".
But this is something the CERN people should have told us beforehand. No idea how many users are having a problem now, not knowing right away how to solve it (I am talking about the ones who do not read in the forum here).
ID: 31773 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 674
Credit: 43,150,492
RAC: 15,942
Message 31775 - Posted: 1 Aug 2017, 18:06:27 UTC - in response to Message 31773.  

I don't take credit for that post, it was HerveUAE who posted that. So credit to whom it belongs ;). I just commented it.
ID: 31775 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,349,726
RAC: 101,646
Message 31776 - Posted: 1 Aug 2017, 19:44:10 UTC - in response to Message 31775.  

I don't take credit for that post, it was HerveUAE who posted that...

sorry for the mix-up from my side :-(
ID: 31776 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · Next

Message boards : ATLAS application : all ATLAS tasks fail after about 10 minutes


©2024 CERN