Thread '-152 (0xFFFFFF68) ERR_NETOPEN and 206 (0x000000CE) EXIT_INIT

Author	Message
Erich56 Send message Joined: 18 Dec 15 Posts: 1980 Credit: 160,774,960 RAC: 39,648	Message 33007 - Posted: 6 Nov 2017, 20:21:11 UTC For about 10 days, sometimes CMS tasks and LHCb tasks error out after about 10 - 14 minutes, the task log showing either -152 (0xFFFFFF68) ERR_NETOPEN or 206 (0x000000CE) EXIT_INIT_FAILURE. This happens on all 3 of my PCs with which I am crunching the VBox projects CMS and LHCb. It does NOT happen with ATLAS. The 3 PCs have different Operating Systems and different versions of VBox (the oldest is 5.1.6, the newest is 5.2.0. Until about 10 days ago, this problem did not occur. A few other people have also reported in the CMS and/or LHCb threads about same problems. With -152 (0xFFFFFF68) ERR_NETOPEN, a typical note in the stderr is: [ERROR] Could not connect to Condor server on port 9618 as example see: https://lhcathome.cern.ch/lhcathome/result.php?resultid=163791182 with 206 (0x000000CE) EXIT_INIT_FAILURE, a typical note in the stderr is: [ERROR] Condor exited after 966s without running a job As example see: https://lhcathome.cern.ch/lhcathome/result.php?resultid=163697914 and then, as I notice right now, there is a third type of error description: 207 (0x000000CF) EXIT_NO_SUB_TASKS here a typical note in the stderr is: **** condor_startd (condor_STARTD) pid 4116 EXITING WITH STATUS 0 [ERROR] No jobs were available to run. as example see: https://lhcathome.cern.ch/lhcathome/result.php?resultid=163788045 Maybe there is some connection problem with the Condor server once in a while, since the error log often says "[ERROR] Could not connect to Condor server on port 9618". Therefore, I have made "ping vccondor01.cern.ch" numerous times on all those PCs, it always worked well. I could get a connection each time. So the problem must be somewhere else. Has anyone any idea what it could be? ID: 33007 · Reply Quote

Toby Broom Volunteer moderator Send message Joined: 27 Sep 08 Posts: 939 Credit: 781,720,253 RAC: 76,849	Message 33008 - Posted: 6 Nov 2017, 22:00:54 UTC - in response to Message 33007. Last modified: 6 Nov 2017, 22:16:19 UTC I don't think Altas uses condor for work submission? I've seen 206 for a long time, sparodically, this is a CERN issue when there is no work on there side, Boinc will create WU even if there is nothing for them to do in the WU. 207 looks the same, although I would say it ran something then couldn't get more work. I don't know how the condor queues are filled and how to see the state of this other than the page https://lhcathomedev.cern.ch/lhcathome-dev/cms_job.php for me I see ca 20% failure on CMS and 18% on Theory, I don't know what the project and we find to be an acceptable error rate? Also for the 206/207 it wastes 10min at a time so it's not a huge waste of computre resources, just makes it difficult to see a real error if there is one. ID: 33008 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1980 Credit: 160,774,960 RAC: 39,648	Message 33009 - Posted: 7 Nov 2017, 6:14:59 UTC Toby, many thanks for your comments and thoughts. I fully agree. The only thing that made me wonder is that here, these problems never occurred before. These types of error reports were unknown to me until 10 or 14 days ago. So perhaps I was only lucky so far. And yes, you're also right when saying that ATLAS does NOT use Condor (that's why such problems did not show up with ATLAS). ID: 33009 · Reply Quote

Toby Broom Volunteer moderator Send message Joined: 27 Sep 08 Posts: 939 Credit: 781,720,253 RAC: 76,849	Message 33010 - Posted: 7 Nov 2017, 7:35:12 UTC I asked the CERN team last night, they are looking at but nothing to report so far. i noticed some stats from a while ago where I was seeing about the same level of 206 errors in January so for it's crept up a few % from then, athough I could have dropped alot in between and I didn't notice as we have all the SixTrack issues. ID: 33010 · Reply Quote

tullio Send message Joined: 19 Feb 08 Posts: 708 Credit: 4,336,250 RAC: 0	Message 33011 - Posted: 7 Nov 2017, 9:21:22 UTC All LHC tasks error on my 3 PCs, two Linux and one Windows 10, save Atlas and SixTrack. They do not fail but the CPUs are doing nothing, as I see from the Task Manager on Windows and the "top" command on Linux. Tullio ID: 33011 · Reply Quote

Harri Liljeroos Send message Joined: 28 Sep 04 Posts: 803 Credit: 65,614,173 RAC: 22,619	Message 33012 - Posted: 7 Nov 2017, 10:27:40 UTC If you look at the LHCb tasks graph here: https://lhcathomedev.cern.ch/lhcathome-dev/lhcb_job.php, you'll see that every day there is a significant drop in the number of successful jobs and correspondingly a surge of failed jobs at the same time. I wonder if these are related to the Condor problems we are seeing. ID: 33012 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1980 Credit: 160,774,960 RAC: 39,648	Message 33013 - Posted: 7 Nov 2017, 11:29:28 UTC - in response to Message 33012. If you look at the LHCb tasks graph here: https://lhcathomedev.cern.ch/lhcathome-dev/lhcb_job.php, you'll see that every day there is a significant drop in the number of successful jobs and correspondingly a surge of failed jobs at the same time. I wonder if these are related to the Condor problems we are seeing. This really looks strange, indeed. What I am wondering is whether this would not have caught anyone's eyes at CERN yet. ID: 33013 · Reply Quote

Toby Broom Volunteer moderator Send message Joined: 27 Sep 08 Posts: 939 Credit: 781,720,253 RAC: 76,849	Message 33014 - Posted: 7 Nov 2017, 16:49:42 UTC It has there attention :) ID: 33014 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1980 Credit: 160,774,960 RAC: 39,648	Message 33075 - Posted: 18 Nov 2017, 15:50:09 UTC still, the ERR_NETOPEN failures occur, due to no connection to Condor server: 2017-11-18 16:37:21 (2688): Guest Log: [DEBUG] Testing connection to Condor server on port 9618 2017-11-18 16:37:51 (2688): Guest Log: [DEBUG] nc: connect to vccondor01.cern.ch port 9618 (tcp) timed out: Operation now in progress 2017-11-18 16:37:51 (2688): Guest Log: [DEBUG] 1 2017-11-18 16:37:51 (2688): Guest Log: [ERROR] Could not connect to Condor server on port 9618 2017-11-18 16:37:51 (2688): Guest Log: [INFO] Shutting Down hopefully, the people at CERN will find out one day what the problem is. ID: 33075 · Reply Quote

ritterm Send message Joined: 30 May 08 Posts: 93 Credit: 5,160,246 RAC: 0	Message 33158 - Posted: 29 Nov 2017, 14:52:21 UTC - in response to Message 33075. still, the ERR_NETOPEN failures occur, due to no connection to Condor server: 2017-11-18 16:37:21 (2688): Guest Log: [DEBUG] Testing connection to Condor server on port 9618 2017-11-18 16:37:51 (2688): Guest Log: [DEBUG] nc: connect to vccondor01.cern.ch port 9618 (tcp) timed out: Operation now in progress 2017-11-18 16:37:51 (2688): Guest Log: [DEBUG] 1 2017-11-18 16:37:51 (2688): Guest Log: [ERROR] Could not connect to Condor server on port 9618 2017-11-18 16:37:51 (2688): Guest Log: [INFO] Shutting Down hopefully, the people at CERN will find out one day what the problem is. I just wanted to bump this up and report that I've been seeing bursts of these same errors recently. Maybe I should be posting this is the LHCb forum, because, for me, they've been occurring primarily, if not exclusively, on LHCb tasks. ID: 33158 · Reply Quote

tullio Send message Joined: 19 Feb 08 Posts: 708 Credit: 4,336,250 RAC: 0	Message 33162 - Posted: 29 Nov 2017, 16:17:26 UTC I am letting a LHCb task and a Theory task on my Windows 10 PC with its 22 GB RAM although I can see from the Task Manager that they are using 0 CPU. Two core Atlas tasks on the same PC run perfectly on its A10-6700 AMD CPU, one core Atlas tasks rune on my 2 Linux boxes with their 8 GB RAM. What is the difference between Atlas tasks and all other LHC tasks (excluding SixTrack, which run perfectly on all PCs)?. Tullio ID: 33162 · Reply Quote

ritterm Send message Joined: 30 May 08 Posts: 93 Credit: 5,160,246 RAC: 0	Message 33163 - Posted: 29 Nov 2017, 16:29:47 UTC - in response to Message 33158. I just wanted to bump this up and report that I've been seeing bursts of these same errors recently. Maybe I should be posting this is the LHCb forum, because, for me, they've been occurring primarily, if not exclusively, on LHCb tasks. Perhaps I should have added that I'm seeing this LHCb behavior on two 16GB RAM Linux hosts that are running two CMS, two Theory, and two LHCb tasks concurrently. No significant issues with CMS or Theory, that I can tell. ID: 33163 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1980 Credit: 160,774,960 RAC: 39,648	Message 33688 - Posted: 6 Jan 2018, 13:15:10 UTC as already posted in the LHCb section, this afternoon many of my LHCb tasks have errored out with either -152 (0xFFFFFF68) ERR_NETOPEN or 207 (0x000000CF) EXIT_NO_SUB_TASKS any idea what's going on at CERN? ID: 33688 · Reply Quote