Message boards :
ATLAS application :
all ATLAS tasks fail after about 10 minutes
Message board moderation
Previous · 1 · 2 · 3 · Next
Author | Message |
---|---|
Send message Joined: 18 Dec 15 Posts: 1752 Credit: 115,677,783 RAC: 87,041 |
would be nice if we had some detailed description somewhere as to what "error Code 65" is. At any rate: something was changed back at the server at CERN two days ago, by which a small number of crunchers is affected. Too bad - I have been crunching ATLAS for more than 1 year, now it's over with :-( |
Send message Joined: 18 Dec 16 Posts: 123 Credit: 37,495,365 RAC: 0 |
Hi Jim, Looking at the log of the tasks on your machine, I can see 2 traces of interest: Setting Memory Size for VM. (3400MB) You need more than 3400 Mbytes to run Atlas tasks. You use the default RAM setting for 1 CPU tasks, which is not enough. You should write your own app_config.xml file to overwrite the default setting and set to, say, 5000MB. FATAL makePool failed From my own experience, this error occurs when you do not have enough RAM allocated, confirming the above observation. We are the product of random evolution. |
Send message Joined: 18 Dec 16 Posts: 123 Credit: 37,495,365 RAC: 0 |
would be nice if we had some detailed description somewhere as to what "error Code 65" is. Hi Erich, When looking at the stderr with error code 65, look at the line that starts with WARNING Transform now exiting early with exit code 65 At the very end of this line (you need to scroll all the way to the right), there are some details that can be of help. I saw the same error as for Jim: FATAL makePool failed Which is the same root cause: not enough RAM allocated (the default 3400MB is not sufficient). Hoping it helps. We are the product of random evolution. |
Send message Joined: 28 Sep 04 Posts: 711 Credit: 47,551,875 RAC: 32,208 |
Hi Jim, Here is a task that has also the error 65 and FATAL makePool failed: https://lhcathome.cern.ch/lhcathome/result.php?resultid=153342993 This was run with memory setting 4500 (single core). The task was validated OK. I have not increased the memory to 5000 MB |
Send message Joined: 18 Dec 15 Posts: 1752 Credit: 115,677,783 RAC: 87,041 |
HerveUAE, thanks for the analysis/comparison of Jim's and my stderr, with the result that there is not enough RAM per task. Seems very interesting and almost unbelievable, because I have been crunching these 1-core ATLAS tasks for several months, without any problems. The fact that lack of memory is the reason for my problem would imply that as of 2 days ago, there was a change in the RAM requirement of the ATLAS tasks. It this was the case, would there not be many more crunchers be affected (I doubt that so many crunchers had implemented an app_config for more RAM to begin with, as this was not neccessary - when I was crunching 2-core ATLAS tasks, I needed to increase the RAM per task via app_config, but never for 1-core tasks). |
Send message Joined: 18 Dec 15 Posts: 1752 Credit: 115,677,783 RAC: 87,041 |
Okay, I wanted to try it with a higher memory setting (5000 MB) via app_config.xml. However, whereas yesterday the download of an ATLAS task took about half an hour, but finally succeeded, today I was NOT able to download a new task. The notice in the BOINC Manager is as follows: 29/07/2017 21:11:25 | LHC@home | Started download of jf_ee87d047116fce70cf9b9e5221a84fc6 29/07/2017 21:11:48 | LHC@home | Temporarily failed download of jf_ee87d047116fce70cf9b9e5221a84fc6: connect() failed 29/07/2017 21:11:48 | LHC@home | Backing off 00:14:12 on download of jf_ee87d047116fce70cf9b9e5221a84fc6 29/07/2017 21:11:51 | | Project communication failed: attempting access to reference site 29/07/2017 21:11:53 | | Internet access OK - project servers may be temporarily down. Even retrying several times did not help. So, this is definite proof that there is a major connectivity problem. They must have made some change to their server two days ago. Once again, just FYI: I did NOT make any changes, neither in hardware nor in software. And CMS tasks are running well as usual. |
Send message Joined: 18 Dec 16 Posts: 123 Credit: 37,495,365 RAC: 0 |
Hi Jim, Erich, I have tried several memory settings over time. My own experience is that the RAM requirement is not fix and varies from one ATLAS task to another. The higher the allocated RAM, the lower is the probability that the task will fail. However, from time to time, one task out of many will fail. I personally think it does not depend on the number of allocated cores, but on the ATLAS algorithm itself. And it could very well be that a given set of tasks has a higher RAM requirement than other sets. I personally have set the RAM to 7000MB and very seldom have issues related to a lack of memory. My laptop has only 8Gbytes so I could allocate only 5800MB to ATLAS. In recent days, I have not had any memory related problems on that machine. There was some extensive tests and discussions in this thread: https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4146#29171 where 5000MB was suggested as a good minimum. Try increasing progressively and see if it helps. We are the product of random evolution. |
Send message Joined: 18 Dec 15 Posts: 1752 Credit: 115,677,783 RAC: 87,041 |
Thanks, HerveUAE. However, as I wrote above, I now am not able to try any RAM setting whatsoever, not being able to connect to the ATLAS download server. |
Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0 |
Hi Jim, Erich, HerveUAE, Thanks. I remember that discussion of a few months ago, and in fact I did use an app_config.xml at that time to fix the problem. I had sort of assumed that was no longer necessary, and had forgotten about it, but thanks for the reminder. |
Send message Joined: 9 Jan 15 Posts: 151 Credit: 431,596,822 RAC: 0 |
Looks like Atlas servers got some issue, getting problem to download task. For some of those that are running i´m not able to open VM console, those task that i could open show no event done. Could we get a status what issue could be? (this only related to Atlas, and all task is not effected to this.) |
Send message Joined: 18 Dec 15 Posts: 1752 Credit: 115,677,783 RAC: 87,041 |
I was able to download an ATLAS task. So at least, they got this problem solved. However, the other problem still remains: the task failed after 16 minutes: error Code 65. Same what I had some 2 days before the download problems started. The stderr can can be seen here: https://lhcathome.cern.ch/lhcathome/result.php?resultid=153369828 |
Send message Joined: 14 Jan 10 Posts: 1378 Credit: 9,162,540 RAC: 5,071 |
However, the other problem still remains: the task failed after 16 minutes: error Code 65. Erich, could you try a (much) newer VBox version: https://www.virtualbox.org/wiki/Downloads |
Send message Joined: 18 Dec 15 Posts: 1752 Credit: 115,677,783 RAC: 87,041 |
I now followed the advice of HerveUAE: Hi Erich, and indeed, after I increased the memory to 5.000MB via app_config.xml, all works fine now. So, obviously, for the first time, the 1-core ATLAS tasks need more memory than allocated as per the BOINC standard. Why so? |
Send message Joined: 28 Sep 04 Posts: 711 Credit: 47,551,875 RAC: 32,208 |
I raised the memory allocation to 6000 MB and still got the Non-zero return code from EVNTtoHITS (65) (Error code 65) and FATAL makePool failed for AthMpEvtLoopMgr.SharedEvtQueueProvider. Task is here https://lhcathome.cern.ch/lhcathome/result.php?resultid=153371975 it was very short (16 minutes CPU time) without the HITS file but it was validated. Anyway one task (https://lhcathome.cern.ch/lhcathome/result.php?resultid=153354120) did finish OK with HITS file before that short one. So not satisfied with the memory explanation. |
Send message Joined: 18 Dec 15 Posts: 1752 Credit: 115,677,783 RAC: 87,041 |
I raised the memory allocation to 6000 MB and still got the Non-zero return code from EVNTtoHITS (65) (Error code 65) and [b]FATAL makePool failed this is strange, indeed. Here, the two ATLAS tasks which I downloaded and startet about 5 1/2 hours ago, are still running fine. However, two things are different compared to before: - after 5 1/2 hours, progress is indicated with only 55%, so the tasks seem to run longer now (before, it was between 6 1/2 and 7 hours) - the VM console opens, but after clicking "Alt" "F2", the usual information is NOT shown. |
Send message Joined: 18 Dec 15 Posts: 1752 Credit: 115,677,783 RAC: 87,041 |
from what I can see so far, the latest ATLAS tasks (since July 26) use up almost 1 GB more memory compared to the ones before. |
Send message Joined: 15 Jun 08 Posts: 2509 Credit: 248,921,808 RAC: 129,940 |
At least the download issues regarding boincai04 seem to be solved. Nonetheless the remaining errors mentioned here are still persistent. I think it makes no sense to stay at ATLAS until the errors are sorted out. I will change my setup to run the other subprojects for a while. |
Send message Joined: 18 Dec 15 Posts: 1752 Credit: 115,677,783 RAC: 87,041 |
At least the download issues regarding boincai04 seem to be solved. From my recent experience, I can say that Harry Liljeroos was right with his posting: Looking at the log of the tasks on your machine, I can see 2 traces of interest: After I had increased the memory to 5000MB (via app_config.xml), all tasks that were downloaded succeeded. However, as I wrote in another thread this afternoon, each 1-core task now takes at least 1GB more RAM than before. As the default setting is 3400MB, any such task is bound to fail, unless the RAM is increased "manually". But this is something the CERN people should have told us beforehand. No idea how many users are having a problem now, not knowing right away how to solve it (I am talking about the ones who do not read in the forum here). |
Send message Joined: 28 Sep 04 Posts: 711 Credit: 47,551,875 RAC: 32,208 |
I don't take credit for that post, it was HerveUAE who posted that. So credit to whom it belongs ;). I just commented it. |
Send message Joined: 18 Dec 15 Posts: 1752 Credit: 115,677,783 RAC: 87,041 |
I don't take credit for that post, it was HerveUAE who posted that... sorry for the mix-up from my side :-( |
©2024 CERN