Message boards :
Number crunching :
-152 (0xFFFFFF68) ERR_NETOPEN and 206 (0x000000CE) EXIT_INIT_FAILURE
Message board moderation
Author | Message |
---|---|
Send message Joined: 18 Dec 15 Posts: 1686 Credit: 100,339,091 RAC: 101,922 |
For about 10 days, sometimes CMS tasks and LHCb tasks error out after about 10 - 14 minutes, the task log showing either -152 (0xFFFFFF68) ERR_NETOPEN or 206 (0x000000CE) EXIT_INIT_FAILURE. This happens on all 3 of my PCs with which I am crunching the VBox projects CMS and LHCb. It does NOT happen with ATLAS. The 3 PCs have different Operating Systems and different versions of VBox (the oldest is 5.1.6, the newest is 5.2.0. Until about 10 days ago, this problem did not occur. A few other people have also reported in the CMS and/or LHCb threads about same problems. With -152 (0xFFFFFF68) ERR_NETOPEN, a typical note in the stderr is: [ERROR] Could not connect to Condor server on port 9618 as example see: https://lhcathome.cern.ch/lhcathome/result.php?resultid=163791182 with 206 (0x000000CE) EXIT_INIT_FAILURE, a typical note in the stderr is: [ERROR] Condor exited after 966s without running a job As example see: https://lhcathome.cern.ch/lhcathome/result.php?resultid=163697914 and then, as I notice right now, there is a third type of error description: 207 (0x000000CF) EXIT_NO_SUB_TASKS here a typical note in the stderr is: **** condor_startd (condor_STARTD) pid 4116 EXITING WITH STATUS 0 [ERROR] No jobs were available to run. as example see: https://lhcathome.cern.ch/lhcathome/result.php?resultid=163788045 Maybe there is some connection problem with the Condor server once in a while, since the error log often says "[ERROR] Could not connect to Condor server on port 9618". Therefore, I have made "ping vccondor01.cern.ch" numerous times on all those PCs, it always worked well. I could get a connection each time. So the problem must be somewhere else. Has anyone any idea what it could be? |
Send message Joined: 27 Sep 08 Posts: 798 Credit: 644,682,894 RAC: 235,435 |
I don't think Altas uses condor for work submission? I've seen 206 for a long time, sparodically, this is a CERN issue when there is no work on there side, Boinc will create WU even if there is nothing for them to do in the WU. 207 looks the same, although I would say it ran something then couldn't get more work. I don't know how the condor queues are filled and how to see the state of this other than the page https://lhcathomedev.cern.ch/lhcathome-dev/cms_job.php for me I see ca 20% failure on CMS and 18% on Theory, I don't know what the project and we find to be an acceptable error rate? Also for the 206/207 it wastes 10min at a time so it's not a huge waste of computre resources, just makes it difficult to see a real error if there is one. |
Send message Joined: 18 Dec 15 Posts: 1686 Credit: 100,339,091 RAC: 101,922 |
Toby, many thanks for your comments and thoughts. I fully agree. The only thing that made me wonder is that here, these problems never occurred before. These types of error reports were unknown to me until 10 or 14 days ago. So perhaps I was only lucky so far. And yes, you're also right when saying that ATLAS does NOT use Condor (that's why such problems did not show up with ATLAS). |
Send message Joined: 27 Sep 08 Posts: 798 Credit: 644,682,894 RAC: 235,435 |
I asked the CERN team last night, they are looking at but nothing to report so far. i noticed some stats from a while ago where I was seeing about the same level of 206 errors in January so for it's crept up a few % from then, athough I could have dropped alot in between and I didn't notice as we have all the SixTrack issues. |
Send message Joined: 19 Feb 08 Posts: 708 Credit: 4,336,250 RAC: 0 |
All LHC tasks error on my 3 PCs, two Linux and one Windows 10, save Atlas and SixTrack. They do not fail but the CPUs are doing nothing, as I see from the Task Manager on Windows and the "top" command on Linux. Tullio |
Send message Joined: 28 Sep 04 Posts: 674 Credit: 43,149,324 RAC: 16,013 |
If you look at the LHCb tasks graph here: https://lhcathomedev.cern.ch/lhcathome-dev/lhcb_job.php, you'll see that every day there is a significant drop in the number of successful jobs and correspondingly a surge of failed jobs at the same time. I wonder if these are related to the Condor problems we are seeing. |
Send message Joined: 18 Dec 15 Posts: 1686 Credit: 100,339,091 RAC: 101,922 |
If you look at the LHCb tasks graph here: https://lhcathomedev.cern.ch/lhcathome-dev/lhcb_job.php, you'll see that every day there is a significant drop in the number of successful jobs and correspondingly a surge of failed jobs at the same time. I wonder if these are related to the Condor problems we are seeing. This really looks strange, indeed. What I am wondering is whether this would not have caught anyone's eyes at CERN yet. |
Send message Joined: 27 Sep 08 Posts: 798 Credit: 644,682,894 RAC: 235,435 |
It has there attention :) |
Send message Joined: 18 Dec 15 Posts: 1686 Credit: 100,339,091 RAC: 101,922 |
still, the ERR_NETOPEN failures occur, due to no connection to Condor server: 2017-11-18 16:37:21 (2688): Guest Log: [DEBUG] Testing connection to Condor server on port 9618 2017-11-18 16:37:51 (2688): Guest Log: [DEBUG] nc: connect to vccondor01.cern.ch port 9618 (tcp) timed out: Operation now in progress 2017-11-18 16:37:51 (2688): Guest Log: [DEBUG] 1 2017-11-18 16:37:51 (2688): Guest Log: [ERROR] Could not connect to Condor server on port 9618 2017-11-18 16:37:51 (2688): Guest Log: [INFO] Shutting Down hopefully, the people at CERN will find out one day what the problem is. |
Send message Joined: 30 May 08 Posts: 93 Credit: 5,160,246 RAC: 0 |
still, the ERR_NETOPEN failures occur, due to no connection to Condor server: I just wanted to bump this up and report that I've been seeing bursts of these same errors recently. Maybe I should be posting this is the LHCb forum, because, for me, they've been occurring primarily, if not exclusively, on LHCb tasks. |
Send message Joined: 19 Feb 08 Posts: 708 Credit: 4,336,250 RAC: 0 |
I am letting a LHCb task and a Theory task on my Windows 10 PC with its 22 GB RAM although I can see from the Task Manager that they are using 0 CPU. Two core Atlas tasks on the same PC run perfectly on its A10-6700 AMD CPU, one core Atlas tasks rune on my 2 Linux boxes with their 8 GB RAM. What is the difference between Atlas tasks and all other LHC tasks (excluding SixTrack, which run perfectly on all PCs)?. Tullio |
Send message Joined: 30 May 08 Posts: 93 Credit: 5,160,246 RAC: 0 |
I just wanted to bump this up and report that I've been seeing bursts of these same errors recently. Maybe I should be posting this is the LHCb forum, because, for me, they've been occurring primarily, if not exclusively, on LHCb tasks. Perhaps I should have added that I'm seeing this LHCb behavior on two 16GB RAM Linux hosts that are running two CMS, two Theory, and two LHCb tasks concurrently. No significant issues with CMS or Theory, that I can tell. |
Send message Joined: 18 Dec 15 Posts: 1686 Credit: 100,339,091 RAC: 101,922 |
as already posted in the LHCb section, this afternoon many of my LHCb tasks have errored out with either -152 (0xFFFFFF68) ERR_NETOPEN or 207 (0x000000CF) EXIT_NO_SUB_TASKS any idea what's going on at CERN? |
©2024 CERN