Thread 'all ATLAS tasks fail after about 10 minutes'

Author	Message
Erich56 Send message Joined: 18 Dec 15 Posts: 1980 Credit: 160,721,371 RAC: 41,829	Message 31698 - Posted: 29 Jul 2017, 15:13:04 UTC would be nice if we had some detailed description somewhere as to what "error Code 65" is. At any rate: something was changed back at the server at CERN two days ago, by which a small number of crunchers is affected. Too bad - I have been crunching ATLAS for more than 1 year, now it's over with :-( ID: 31698 · Reply Quote

HerveUAE Send message Joined: 18 Dec 16 Posts: 123 Credit: 37,495,365 RAC: 0	Message 31700 - Posted: 29 Jul 2017, 16:51:12 UTC - in response to Message 31697. Hi Jim, Looking at the log of the tasks on your machine, I can see 2 traces of interest: Setting Memory Size for VM. (3400MB) You need more than 3400 Mbytes to run Atlas tasks. You use the default RAM setting for 1 CPU tasks, which is not enough. You should write your own app_config.xml file to overwrite the default setting and set to, say, 5000MB. FATAL makePool failed From my own experience, this error occurs when you do not have enough RAM allocated, confirming the above observation. We are the product of random evolution. ID: 31700 · Reply Quote

HerveUAE Send message Joined: 18 Dec 16 Posts: 123 Credit: 37,495,365 RAC: 0	Message 31701 - Posted: 29 Jul 2017, 16:58:38 UTC - in response to Message 31698. would be nice if we had some detailed description somewhere as to what "error Code 65" is. Hi Erich, When looking at the stderr with error code 65, look at the line that starts with WARNING Transform now exiting early with exit code 65 At the very end of this line (you need to scroll all the way to the right), there are some details that can be of help. I saw the same error as for Jim: FATAL makePool failed Which is the same root cause: not enough RAM allocated (the default 3400MB is not sufficient). Hoping it helps. We are the product of random evolution. ID: 31701 · Reply Quote

Harri Liljeroos Send message Joined: 28 Sep 04 Posts: 803 Credit: 65,577,448 RAC: 23,231	Message 31702 - Posted: 29 Jul 2017, 18:14:48 UTC - in response to Message 31700. Hi Jim, Looking at the log of the tasks on your machine, I can see 2 traces of interest: Setting Memory Size for VM. (3400MB) You need more than 3400 Mbytes to run Atlas tasks. You use the default RAM setting for 1 CPU tasks, which is not enough. You should write your own app_config.xml file to overwrite the default setting and set to, say, 5000MB. FATAL makePool failed From my own experience, this error occurs when you do not have enough RAM allocated, confirming the above observation. Here is a task that has also the error 65 and FATAL makePool failed: https://lhcathome.cern.ch/lhcathome/result.php?resultid=153342993 This was run with memory setting 4500 (single core). The task was validated OK. I have not increased the memory to 5000 MB ID: 31702 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1980 Credit: 160,721,371 RAC: 41,829	Message 31703 - Posted: 29 Jul 2017, 18:24:51 UTC HerveUAE, thanks for the analysis/comparison of Jim's and my stderr, with the result that there is not enough RAM per task. Seems very interesting and almost unbelievable, because I have been crunching these 1-core ATLAS tasks for several months, without any problems. The fact that lack of memory is the reason for my problem would imply that as of 2 days ago, there was a change in the RAM requirement of the ATLAS tasks. It this was the case, would there not be many more crunchers be affected (I doubt that so many crunchers had implemented an app_config for more RAM to begin with, as this was not neccessary - when I was crunching 2-core ATLAS tasks, I needed to increase the RAM per task via app_config, but never for 1-core tasks). ID: 31703 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1980 Credit: 160,721,371 RAC: 41,829	Message 31704 - Posted: 29 Jul 2017, 19:25:19 UTC Okay, I wanted to try it with a higher memory setting (5000 MB) via app_config.xml. However, whereas yesterday the download of an ATLAS task took about half an hour, but finally succeeded, today I was NOT able to download a new task. The notice in the BOINC Manager is as follows: 29/07/2017 21:11:25 \| LHC@home \| Started download of jf_ee87d047116fce70cf9b9e5221a84fc6 29/07/2017 21:11:48 \| LHC@home \| Temporarily failed download of jf_ee87d047116fce70cf9b9e5221a84fc6: connect() failed 29/07/2017 21:11:48 \| LHC@home \| Backing off 00:14:12 on download of jf_ee87d047116fce70cf9b9e5221a84fc6 29/07/2017 21:11:51 \| \| Project communication failed: attempting access to reference site 29/07/2017 21:11:53 \| \| Internet access OK - project servers may be temporarily down. Even retrying several times did not help. So, this is definite proof that there is a major connectivity problem. They must have made some change to their server two days ago. Once again, just FYI: I did NOT make any changes, neither in hardware nor in software. And CMS tasks are running well as usual. ID: 31704 · Reply Quote

HerveUAE Send message Joined: 18 Dec 16 Posts: 123 Credit: 37,495,365 RAC: 0	Message 31705 - Posted: 29 Jul 2017, 19:40:46 UTC Hi Jim, Erich, I have tried several memory settings over time. My own experience is that the RAM requirement is not fix and varies from one ATLAS task to another. The higher the allocated RAM, the lower is the probability that the task will fail. However, from time to time, one task out of many will fail. I personally think it does not depend on the number of allocated cores, but on the ATLAS algorithm itself. And it could very well be that a given set of tasks has a higher RAM requirement than other sets. I personally have set the RAM to 7000MB and very seldom have issues related to a lack of memory. My laptop has only 8Gbytes so I could allocate only 5800MB to ATLAS. In recent days, I have not had any memory related problems on that machine. There was some extensive tests and discussions in this thread: https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4146#29171 where 5000MB was suggested as a good minimum. Try increasing progressively and see if it helps. We are the product of random evolution. ID: 31705 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1980 Credit: 160,721,371 RAC: 41,829	Message 31706 - Posted: 29 Jul 2017, 19:50:09 UTC - in response to Message 31705. Thanks, HerveUAE. However, as I wrote above, I now am not able to try any RAM setting whatsoever, not being able to connect to the ATLAS download server. ID: 31706 · Reply Quote

Jim1348 Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0	Message 31708 - Posted: 29 Jul 2017, 21:02:58 UTC - in response to Message 31705. Last modified: 29 Jul 2017, 21:04:19 UTC Hi Jim, Erich, I have tried several memory settings over time. My own experience is that the RAM requirement is not fix and varies from one ATLAS task to another. The higher the allocated RAM, the lower is the probability that the task will fail. However, from time to time, one task out of many will fail. I personally think it does not depend on the number of allocated cores, but on the ATLAS algorithm itself. And it could very well be that a given set of tasks has a higher RAM requirement than other sets. I personally have set the RAM to 7000MB and very seldom have issues related to a lack of memory. My laptop has only 8Gbytes so I could allocate only 5800MB to ATLAS. In recent days, I have not had any memory related problems on that machine. There was some extensive tests and discussions in this thread: https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4146#29171 where 5000MB was suggested as a good minimum. Try increasing progressively and see if it helps. HerveUAE, Thanks. I remember that discussion of a few months ago, and in fact I did use an app_config.xml at that time to fix the problem. I had sort of assumed that was no longer necessary, and had forgotten about it, but thanks for the reminder. ID: 31708 · Reply Quote

Greger Send message Joined: 9 Jan 15 Posts: 151 Credit: 431,596,822 RAC: 0	Message 31709 - Posted: 29 Jul 2017, 21:18:32 UTC Last modified: 29 Jul 2017, 21:21:42 UTC Looks like Atlas servers got some issue, getting problem to download task. For some of those that are running iÂ´m not able to open VM console, those task that i could open show no event done. Could we get a status what issue could be? (this only related to Atlas, and all task is not effected to this.) ID: 31709 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1980 Credit: 160,721,371 RAC: 41,829	Message 31741 - Posted: 31 Jul 2017, 13:31:25 UTC Last modified: 31 Jul 2017, 13:34:03 UTC I was able to download an ATLAS task. So at least, they got this problem solved. However, the other problem still remains: the task failed after 16 minutes: error Code 65. Same what I had some 2 days before the download problems started. The stderr can can be seen here: https://lhcathome.cern.ch/lhcathome/result.php?resultid=153369828 ID: 31741 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1552 Credit: 10,070,137 RAC: 576	Message 31742 - Posted: 31 Jul 2017, 14:41:59 UTC - in response to Message 31741. However, the other problem still remains: the task failed after 16 minutes: error Code 65. Same what I had some 2 days before the download problems started. The stderr can can be seen here: https://lhcathome.cern.ch/lhcathome/result.php?resultid=153369828 Erich, could you try a (much) newer VBox version: https://www.virtualbox.org/wiki/Downloads ID: 31742 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1980 Credit: 160,721,371 RAC: 41,829	Message 31743 - Posted: 31 Jul 2017, 14:55:06 UTC - in response to Message 31701. Last modified: 31 Jul 2017, 15:35:01 UTC I now followed the advice of HerveUAE: Hi Erich, When looking at the stderr with error code 65, look at the line that starts with WARNING Transform now exiting early with exit code 65 At the very end of this line (you need to scroll all the way to the right), there are some details that can be of help. I saw the same error as for Jim: FATAL makePool failed Which is the same root cause: not enough RAM allocated (the default 3400MB is not sufficient). Hoping it helps. and indeed, after I increased the memory to 5.000MB via app_config.xml, all works fine now. So, obviously, for the first time, the 1-core ATLAS tasks need more memory than allocated as per the BOINC standard. Why so? ID: 31743 · Reply Quote

Harri Liljeroos Send message Joined: 28 Sep 04 Posts: 803 Credit: 65,577,448 RAC: 23,231	Message 31744 - Posted: 31 Jul 2017, 17:58:02 UTC I raised the memory allocation to 6000 MB and still got the Non-zero return code from EVNTtoHITS (65) (Error code 65) and FATAL makePool failed for AthMpEvtLoopMgr.SharedEvtQueueProvider. Task is here https://lhcathome.cern.ch/lhcathome/result.php?resultid=153371975 it was very short (16 minutes CPU time) without the HITS file but it was validated. Anyway one task (https://lhcathome.cern.ch/lhcathome/result.php?resultid=153354120) did finish OK with HITS file before that short one. So not satisfied with the memory explanation. ID: 31744 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1980 Credit: 160,721,371 RAC: 41,829	Message 31745 - Posted: 31 Jul 2017, 19:22:00 UTC - in response to Message 31744. I raised the memory allocation to 6000 MB and still got the Non-zero return code from EVNTtoHITS (65) (Error code 65) and [b]FATAL makePool failed this is strange, indeed. Here, the two ATLAS tasks which I downloaded and startet about 5 1/2 hours ago, are still running fine. However, two things are different compared to before: - after 5 1/2 hours, progress is indicated with only 55%, so the tasks seem to run longer now (before, it was between 6 1/2 and 7 hours) - the VM console opens, but after clicking "Alt" "F2", the usual information is NOT shown. ID: 31745 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1980 Credit: 160,721,371 RAC: 41,829	Message 31771 - Posted: 1 Aug 2017, 13:18:02 UTC from what I can see so far, the latest ATLAS tasks (since July 26) use up almost 1 GB more memory compared to the ones before. ID: 31771 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2743 Credit: 302,456,610 RAC: 86,594	Message 31772 - Posted: 1 Aug 2017, 14:41:33 UTC At least the download issues regarding boincai04 seem to be solved. Nonetheless the remaining errors mentioned here are still persistent. I think it makes no sense to stay at ATLAS until the errors are sorted out. I will change my setup to run the other subprojects for a while. ID: 31772 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1980 Credit: 160,721,371 RAC: 41,829	Message 31773 - Posted: 1 Aug 2017, 15:43:20 UTC - in response to Message 31772. Last modified: 1 Aug 2017, 15:43:50 UTC At least the download issues regarding boincai04 seem to be solved. Nonetheless the remaining errors From my recent experience, I can say that Harry Liljeroos was right with his posting: Looking at the log of the tasks on your machine, I can see 2 traces of interest: Setting Memory Size for VM. (3400MB) You need more than 3400 Mbytes to run Atlas tasks. You use the default RAM setting for 1 CPU tasks, which is not enough. You should write your own app_config.xml file to overwrite the default setting and set to, say, 5000MB. FATAL makePool failed From my own experience, this error occurs when you do not have enough RAM allocated, confirming the above observation. After I had increased the memory to 5000MB (via app_config.xml), all tasks that were downloaded succeeded. However, as I wrote in another thread this afternoon, each 1-core task now takes at least 1GB more RAM than before. As the default setting is 3400MB, any such task is bound to fail, unless the RAM is increased "manually". But this is something the CERN people should have told us beforehand. No idea how many users are having a problem now, not knowing right away how to solve it (I am talking about the ones who do not read in the forum here). ID: 31773 · Reply Quote

Harri Liljeroos Send message Joined: 28 Sep 04 Posts: 803 Credit: 65,577,448 RAC: 23,231	Message 31775 - Posted: 1 Aug 2017, 18:06:27 UTC - in response to Message 31773. I don't take credit for that post, it was HerveUAE who posted that. So credit to whom it belongs ;). I just commented it. ID: 31775 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1980 Credit: 160,721,371 RAC: 41,829	Message 31776 - Posted: 1 Aug 2017, 19:44:10 UTC - in response to Message 31775. I don't take credit for that post, it was HerveUAE who posted that... sorry for the mix-up from my side :-( ID: 31776 · Reply Quote