all ATLAS tasks fail after about 10 minutes

Author	Message
Erich56 Send message Joined: 18 Dec 15 Posts: 1874 Credit: 137,356,210 RAC: 44,793	Message 31631 - Posted: 26 Jul 2017, 9:49:41 UTC Last modified: 26 Jul 2017, 10:27:07 UTC for the past few hours, I notice that all ATLAS Tasks are failing after about 12-14 minutes. One example here: https://lhcathome.cern.ch/lhcathome/result.php?resultid=153325622 any idea what the problem could be? ID: 31631 · Reply Quote

Harri Liljeroos Send message Joined: 28 Sep 04 Posts: 763 Credit: 56,472,686 RAC: 30,440	Message 31632 - Posted: 26 Jul 2017, 10:32:44 UTC - in response to Message 31631. Mine are running normally, no problems so far today. ID: 31632 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1874 Credit: 137,356,210 RAC: 44,793	Message 31633 - Posted: 26 Jul 2017, 10:48:53 UTC - in response to Message 31631. One example here: https://lhcathome.cern.ch/lhcathome/result.php?resultid=153325622 I am not an expert in understanding the Stderr Report, but obviously the Problem began here: 2017-07-26 09:37:00 (4696): Guest Log: - Last 10 lines from /home/atlas01/RunAtlas/Panda_Pilot_3409_1501054006/PandaJob/athena_stdout.txt - 2017-07-26 09:37:00 (4696): Guest Log: PyJobTransforms.trfExe.preExecute 2017-07-26 09:28:57,321 INFO Batch/grid running - command outputs will not be echoed. Logs for EVNTtoHITS are in log.EVNTtoHITS 2017-07-26 09:37:00 (4696): Guest Log: PyJobTransforms.trfExe.preExecute 2017-07-26 09:28:57,323 INFO Now writing wrapper for substep executor EVNTtoHITS 2017-07-26 09:37:00 (4696): Guest Log: PyJobTransforms.trfExe._writeAthenaWrapper 2017-07-26 09:28:57,323 INFO Valgrind not engaged 2017-07-26 09:37:00 (4696): Guest Log: PyJobTransforms.trfExe.preExecute 2017-07-26 09:28:57,323 INFO Athena will be executed in a subshell via ['./runwrapper.EVNTtoHITS.sh'] 2017-07-26 09:37:00 (4696): Guest Log: PyJobTransforms.trfExe.execute 2017-07-26 09:28:57,324 INFO Starting execution of EVNTtoHITS (['./runwrapper.EVNTtoHITS.sh']) 2017-07-26 09:37:00 (4696): Guest Log: PyJobTransforms.trfExe.execute 2017-07-26 09:34:27,512 INFO EVNTtoHITS executor returns 65 2017-07-26 09:37:00 (4696): Guest Log: PyJobTransforms.trfExe.validate 2017-07-26 09:34:28,522 ERROR Validation of return code failed: Non-zero return code from EVNTtoHITS (65) (Error code 65) 2017-07-26 09:37:00 (4696): Guest Log: PyJobTransforms.trfExe.validate 2017-07-26 09:34:28,564 INFO Scanning logfile log.EVNTtoHITS for errors 2017-07-26 09:37:00 (4696): Guest Log: PyJobTransforms.transform.execute 2017-07-26 09:34:28,917 CRITICAL Transform executor raised TransformValidationException: Non-zero return code from EVNTtoHITS (65); Logfile error in log.EVNTtoHITS: "AthMpEvtLoopMgr FATAL makePool failed for AthMpEvtLoopMgr.SharedEvtQueueProvider" 2017-07-26 09:37:00 (4696): Guest Log: PyJobTransforms.transform.execute 2017-07-26 09:34:32,049 WARNING Transform now exiting early with exit code 65 (Non-zero return code from EVNTtoHITS (65); Logfile error in log.EVNTtoHITS: "AthMpEvtLoopMgr FATAL makePool failed for AthMpEvtLoopMgr.SharedEvtQueueProvider") 2017-07-26 09:37:00 (4696): Guest Log: - Walltime - ID: 31633 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1874 Credit: 137,356,210 RAC: 44,793	Message 31634 - Posted: 26 Jul 2017, 15:02:08 UTC Last modified: 26 Jul 2017, 15:05:13 UTC I now downloaded a new ATLAS task in order to test it. Same behaviour as the others from this morning: it failed and stopped after 13 minutes. See here: https://lhcathome.cern.ch/lhcathome/result.php?resultid=153329396 Is there something wrong with my system all of a sudden? I remember from previous such cases with tasks that stopped after this short period of time that there was some connectivity problem or similar. Can anyone read from the Stderr what the problem is? What catches my eye, for example, is the entry 2017-07-26 16:51:52 (2932): Guest Log: PyJobTransforms.trfExe.validate 2017-07-26 16:49:04,394 ERROR Validation of return code failed: Non-zero return code from EVNTtoHITS (65) (Error code 65) What does Error Code 65 mean? ID: 31634 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1874 Credit: 137,356,210 RAC: 44,793	Message 31656 - Posted: 27 Jul 2017, 13:01:26 UTC I now started an ATLAS task on one of my other PCs on which I have so far crunched CMS successfully. The first strange thing I noticed right away was that the download of the 117MB size file took some 28 minutes - while my Internet Connections runs with a download speed of ~ 155 Mbits/s. An also here, like on the other system since yesterday, the Task failed after some 13 minutes. And again, error Code 65 is in the play: 2017-07-27 14:39:36 (11964): Guest Log: PyJobTransforms.trfExe.execute 2017-07-27 14:37:24,559 INFO EVNTtoHITS executor returns 65 2017-07-27 14:39:36 (11964): Guest Log: PyJobTransforms.trfExe.validate 2017-07-27 14:37:25,573 ERROR Validation of return code failed: Non-zero return code from EVNTtoHITS (65) (Error code 65) 2017-07-27 14:39:36 (11964): Guest Log: PyJobTransforms.trfExe.validate 2017-07-27 14:37:25,612 INFO Scanning logfile log.EVNTtoHITS for errors 2017-07-27 14:39:36 (11964): Guest Log: PyJobTransforms.transform.execute 2017-07-27 14:37:25,960 CRITICAL Transform executor raised TransformValidationException: Non-zero return code from EVNTtoHITS (65); Logfile error in log.EVNTtoHITS: "AthMpEvtLoopMgr FATAL makePool failed for AthMpEvtLoopMgr.SharedEvtQueueProvider" 2017-07-27 14:39:36 (11964): Guest Log: PyJobTransforms.transform.execute 2017-07-27 14:37:29,177 WARNING Transform now exiting early with exit code 65 (Non-zero return code from EVNTtoHITS (65); Logfile error in log.EVNTtoHITS: "AthMpEvtLoopMgr FATAL makePool failed for AthMpEvtLoopMgr.SharedEvtQueueProvider") 2017-07-27 14:39:36 (11964): Guest Log: - Walltime - the complete STDERR log can be seen here: https://lhcathome.cern.ch/lhcathome/result.php?resultid=153336542 so, the problem is not restricted to only one of my PCs, but obviously occurs on all my PCs. Particulary in view of the obove mentioned download length of 28 minutes for a 117MB file, I am pretty sure that there is some server/connectivity problem. Was there a recent change of the ATLAS server address (i.e. 2 days ago)? Does anyone know the name or IP address of the ATLAS Server? So that I could ping it testwise? BTW: I now switched back to CMS tasks, they run without any problem. ID: 31656 · Reply Quote

Jim1348 Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0	Message 31657 - Posted: 27 Jul 2017, 13:54:50 UTC I also see a few of them marked "Validate error" that run for under 20 minutes on my machine. They also have the "PyJobTransforms.trfExe.validate" error 65. But most of mine are working OK, and I don't see any problems downloading or uploading here. https://lhcathome.cern.ch/lhcathome/results.php?hostid=10477864&offset=0&show_names=0&state=0&appid=14 Curiously, both of the ones thus far that have validated on other machines were on Windows 10 machines, and I thought it was a Linux problem. But I see that you are running Windows 10, so it seems to be on both. And it just started yesterday, so there is something happening. ID: 31657 · Reply Quote

Yeti Volunteer moderator Send message Joined: 2 Sep 04 Posts: 455 Credit: 213,665,777 RAC: 22,440	Message 31658 - Posted: 27 Jul 2017, 14:34:26 UTC @Erich Because of your post I had a close look on my clients, but it seems, all of mine are doing fine. One thing may be, that Atlas has accidently used a Backend-Server that is not on the list of your Firewall-exceptions, if you are using one. I have seen a small number of Tasks that seem to have been aborted through the server, so, may be, your post has already initiated some action in CERN Supporting BOINC, a great concept ! ID: 31658 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1874 Credit: 137,356,210 RAC: 44,793	Message 31662 - Posted: 27 Jul 2017, 16:14:53 UTC - in response to Message 31658. @ Yeti, many thanks for your hints. I have no other Firewall than the one that comes with Windows10. And what seems strange to me: the everlasting download of an ATLAS task was successful after all; I would guess that if any kind of Firewall or whatever would block a server to which ATLAS might have changed 2 days ago, then I wouldn't even be connected to there, right? From what I understand, an ATLAS task, while running, in certain intervals needs to be able to establish a connection to the server within a certain time (no idea: is it 60 seconds, or less, or more?); and if this connection cannot be established, the task breaks off. So this seems to be exactly what happens here. Probably NOT because a connection cannot be established at all, but rather because it cannot be established timely enough. ID: 31662 · Reply Quote

Jim1348 Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0	Message 31678 - Posted: 28 Jul 2017, 8:10:51 UTC Last modified: 28 Jul 2017, 8:13:26 UTC I am continuing to receive those validate errors with Code 65 more regularly now. https://lhcathome.cern.ch/lhcathome/results.php?hostid=10477864&offset=0&show_names=0&state=0&appid=14 About the time they started two days ago, I had updated my Linux kernel from 4.8.0 to 4.10.0.27. That might have something to do with it. I will just remove ATLAS from my preferences for now. ID: 31678 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1874 Credit: 137,356,210 RAC: 44,793	Message 31680 - Posted: 28 Jul 2017, 10:45:28 UTC - in response to Message 31678. About the time they started two days ago, I had updated my Linux kernel from 4.8.0 to 4.10.0.27. That might have something to do with it. in my case, I did NOT make any changes, nothing whatsoever, on both PCs. And yet ATLAS started failing 2 days ago. Since I guess the the two of us are not the only ones to experience these problems (some others may not have noticed them yet), I wonder how long it will take them at CERN to find out that something is wrong. ID: 31680 · Reply Quote

Jim1348 Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0	Message 31681 - Posted: 28 Jul 2017, 10:47:25 UTC - in response to Message 31680. Last modified: 28 Jul 2017, 11:11:59 UTC Since I guess the the two of us are not the only ones to experience these problems (some others may not have noticed them yet), I wonder how long it will take them at CERN to find out that something is wrong. I hope sooner rather than later. But validator problems are often hard to solve. They are much better at it at CERN than some other projects. Note that you are comparing two numbers that are normally very closely matched. Any small differences can cause an error. And the differences may be due to hardware or software changes at either end. And they could be network problems too, though I have not noticed anything here. But any small glitch might do it. They probably won't (and shouldn't) get too worried about it until it affects a large number of people. Maybe that is happening now, or maybe not. (But since these fail after a short time period, it is hopefully something more obvious that they can catch sooner.) ID: 31681 · Reply Quote

Harri Liljeroos Send message Joined: 28 Sep 04 Posts: 763 Credit: 56,472,686 RAC: 30,440	Message 31683 - Posted: 28 Jul 2017, 17:29:07 UTC Here is one ATlas task that I crunched that has the same errors as Erich's tasks https://lhcathome.cern.ch/lhcathome/result.php?resultid=153342993. The difference is that this one validated, CPU time was 23.5 minutes anding with errorcode 65. And no HITS file either. This was on a win 7 x64 laptop. ID: 31683 · Reply Quote

Harri Liljeroos Send message Joined: 28 Sep 04 Posts: 763 Credit: 56,472,686 RAC: 30,440	Message 31684 - Posted: 28 Jul 2017, 18:14:20 UTC Here's another Atlas task that finished on the same laptop as above: https://lhcathome.cern.ch/lhcathome/result.php?resultid=153343096 This ran for 2.5 hours (CPU time) and validated OK with HITS file and everything, but it didn't show any job information with Alt+F2 on the console window. ID: 31684 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1874 Credit: 137,356,210 RAC: 44,793	Message 31685 - Posted: 28 Jul 2017, 19:24:33 UTC To me this all seems to be a connectivity problem between the host and the server at CERN. ID: 31685 · Reply Quote

Harri Liljeroos Send message Joined: 28 Sep 04 Posts: 763 Credit: 56,472,686 RAC: 30,440	Message 31686 - Posted: 28 Jul 2017, 21:12:57 UTC - in response to Message 31685. To me this all seems to be a connectivity problem between the host and the server at CERN. I totally agree. Too bad that the stderr does not tell us all the steps the task (or the events) have gone thru and where did they fail and how to resolve the problem. ID: 31686 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1874 Credit: 137,356,210 RAC: 44,793	Message 31689 - Posted: 29 Jul 2017, 5:31:44 UTC the strange thing though is that obviously only a small number of crunchers is affected (otherwise there would be a lot more such complaint postings here, and the CERN people would have noticed it already - which seemingly is not the case). The problem may occur only under very specific or rare circumstances, particular settings or whatever (maybe not even such ones done by the cruncher, but rather by the ISP), or whatever. That's why it would be so important to know the IP address of the server, in order to try out things like ping, traceroute, etc. ID: 31689 · Reply Quote

Jim1348 Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0	Message 31693 - Posted: 29 Jul 2017, 8:22:35 UTC - in response to Message 31689. Since this problem does not seem to be solving itself, I thought I would try some other tests. I added a second machine to LHC, being an i5-3550 also running Ubuntu 16.04 with Linux 4.10.0.27 and VirtualBox 5.1.26 as with the i7-4790 machine. But the i5-3550 PC has a wired network connection, whereas the i7-4790 is WiFi. However, the result (after only one work unit) is the same. https://lhcathome.cern.ch/lhcathome/results.php?hostid=10492036 I am not quite done yet, and want to try out some other combinations of hardware and software, but this certainly points to an external problem from our machines. But the ATLAS work unit downloaded fine at 500 Kbps for two files simultaneously, with BOINC being set for a 1000 Kbps total download speed. So it is nothing obvious with the network yet. ID: 31693 · Reply Quote

Yeti Volunteer moderator Send message Joined: 2 Sep 04 Posts: 455 Credit: 213,665,777 RAC: 22,440	Message 31694 - Posted: 29 Jul 2017, 9:55:29 UTC - in response to Message 31693. However, the result (after only one work unit) is the same. https://lhcathome.cern.ch/lhcathome/results.php?hostid=10492036 Did you already take a walk through my checklist ? Supporting BOINC, a great concept ! ID: 31694 · Reply Quote

Jim1348 Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0	Message 31695 - Posted: 29 Jul 2017, 11:44:24 UTC - in response to Message 31694. Did you already take a walk through my checklist ? Thanks, but that machine has been running LHC-dev just fine for several weeks. https://lhcathomedev.cern.ch/lhcathome-dev/results.php?hostid=2137 Unfortunately, LHC-dev has not sent me any ATLAS recently though. ID: 31695 · Reply Quote

Jim1348 Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0	Message 31697 - Posted: 29 Jul 2017, 15:00:30 UTC Last modified: 29 Jul 2017, 15:00:51 UTC I was hoping that things would be different on my Win7 64-bit machine (i7-4771, VirtualBox 5.1.26, wired network connection), but no luck there either. https://lhcathome.cern.ch/lhcathome/results.php?hostid=10459739 ID: 31697 · Reply Quote

LHC@home