Message boards : Number crunching : -152 (0xFFFFFF68) ERR_NETOPEN and 206 (0x000000CE) EXIT_INIT_FAILURE
Message board moderation

To post messages, you must log in.

AuthorMessage
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,338,329
RAC: 101,962
Message 33007 - Posted: 6 Nov 2017, 20:21:11 UTC

For about 10 days, sometimes CMS tasks and LHCb tasks error out after about 10 - 14 minutes, the task log showing either
-152 (0xFFFFFF68) ERR_NETOPEN or
206 (0x000000CE) EXIT_INIT_FAILURE.

This happens on all 3 of my PCs with which I am crunching the VBox projects CMS and LHCb. It does NOT happen with ATLAS.
The 3 PCs have different Operating Systems and different versions of VBox (the oldest is 5.1.6, the newest is 5.2.0.

Until about 10 days ago, this problem did not occur. A few other people have also reported in the CMS and/or LHCb threads about same problems.

With -152 (0xFFFFFF68) ERR_NETOPEN, a typical note in the stderr is:
[ERROR] Could not connect to Condor server on port 9618
as example see: https://lhcathome.cern.ch/lhcathome/result.php?resultid=163791182

with 206 (0x000000CE) EXIT_INIT_FAILURE, a typical note in the stderr is: [ERROR] Condor exited after 966s without running a job
As example see: https://lhcathome.cern.ch/lhcathome/result.php?resultid=163697914

and then, as I notice right now, there is a third type of error description:
207 (0x000000CF) EXIT_NO_SUB_TASKS
here a typical note in the stderr is: **** condor_startd (condor_STARTD) pid 4116 EXITING WITH STATUS 0 [ERROR] No jobs were available to run.
as example see: https://lhcathome.cern.ch/lhcathome/result.php?resultid=163788045

Maybe there is some connection problem with the Condor server once in a while, since the error log often says "[ERROR] Could not connect to Condor server on port 9618".
Therefore, I have made "ping vccondor01.cern.ch" numerous times on all those PCs, it always worked well. I could get a connection each time.
So the problem must be somewhere else.
Has anyone any idea what it could be?
ID: 33007 · Report as offensive     Reply Quote
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 798
Credit: 644,679,969
RAC: 235,452
Message 33008 - Posted: 6 Nov 2017, 22:00:54 UTC - in response to Message 33007.  
Last modified: 6 Nov 2017, 22:16:19 UTC

I don't think Altas uses condor for work submission?

I've seen 206 for a long time, sparodically, this is a CERN issue when there is no work on there side, Boinc will create WU even if there is nothing for them to do in the WU. 207 looks the same, although I would say it ran something then couldn't get more work.

I don't know how the condor queues are filled and how to see the state of this other than the page https://lhcathomedev.cern.ch/lhcathome-dev/cms_job.php

for me I see ca 20% failure on CMS and 18% on Theory, I don't know what the project and we find to be an acceptable error rate?

Also for the 206/207 it wastes 10min at a time so it's not a huge waste of computre resources, just makes it difficult to see a real error if there is one.
ID: 33008 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,338,329
RAC: 101,962
Message 33009 - Posted: 7 Nov 2017, 6:14:59 UTC

Toby, many thanks for your comments and thoughts. I fully agree.

The only thing that made me wonder is that here, these problems never occurred before. These types of error reports were unknown to me until 10 or 14 days ago. So perhaps I was only lucky so far.

And yes, you're also right when saying that ATLAS does NOT use Condor (that's why such problems did not show up with ATLAS).
ID: 33009 · Report as offensive     Reply Quote
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 798
Credit: 644,679,969
RAC: 235,452
Message 33010 - Posted: 7 Nov 2017, 7:35:12 UTC

I asked the CERN team last night, they are looking at but nothing to report so far.

i noticed some stats from a while ago where I was seeing about the same level of 206 errors in January so for it's crept up a few % from then, athough I could have dropped alot in between and I didn't notice as we have all the SixTrack issues.
ID: 33010 · Report as offensive     Reply Quote
tullio

Send message
Joined: 19 Feb 08
Posts: 708
Credit: 4,336,250
RAC: 0
Message 33011 - Posted: 7 Nov 2017, 9:21:22 UTC

All LHC tasks error on my 3 PCs, two Linux and one Windows 10, save Atlas and SixTrack. They do not fail but the CPUs are doing nothing, as I see from the Task Manager on Windows and the "top" command on Linux.
Tullio
ID: 33011 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 674
Credit: 43,148,997
RAC: 15,990
Message 33012 - Posted: 7 Nov 2017, 10:27:40 UTC

If you look at the LHCb tasks graph here: https://lhcathomedev.cern.ch/lhcathome-dev/lhcb_job.php, you'll see that every day there is a significant drop in the number of successful jobs and correspondingly a surge of failed jobs at the same time. I wonder if these are related to the Condor problems we are seeing.
ID: 33012 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,338,329
RAC: 101,962
Message 33013 - Posted: 7 Nov 2017, 11:29:28 UTC - in response to Message 33012.  

If you look at the LHCb tasks graph here: https://lhcathomedev.cern.ch/lhcathome-dev/lhcb_job.php, you'll see that every day there is a significant drop in the number of successful jobs and correspondingly a surge of failed jobs at the same time. I wonder if these are related to the Condor problems we are seeing.

This really looks strange, indeed.
What I am wondering is whether this would not have caught anyone's eyes at CERN yet.
ID: 33013 · Report as offensive     Reply Quote
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 798
Credit: 644,679,969
RAC: 235,452
Message 33014 - Posted: 7 Nov 2017, 16:49:42 UTC

It has there attention :)
ID: 33014 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,338,329
RAC: 101,962
Message 33075 - Posted: 18 Nov 2017, 15:50:09 UTC

still, the ERR_NETOPEN failures occur, due to no connection to Condor server:

2017-11-18 16:37:21 (2688): Guest Log: [DEBUG] Testing connection to Condor server on port 9618
2017-11-18 16:37:51 (2688): Guest Log: [DEBUG] nc: connect to vccondor01.cern.ch port 9618 (tcp) timed out: Operation now in progress
2017-11-18 16:37:51 (2688): Guest Log: [DEBUG] 1
2017-11-18 16:37:51 (2688): Guest Log: [ERROR] Could not connect to Condor server on port 9618
2017-11-18 16:37:51 (2688): Guest Log: [INFO] Shutting Down

hopefully, the people at CERN will find out one day what the problem is.
ID: 33075 · Report as offensive     Reply Quote
Profile ritterm
Avatar

Send message
Joined: 30 May 08
Posts: 93
Credit: 5,160,246
RAC: 0
Message 33158 - Posted: 29 Nov 2017, 14:52:21 UTC - in response to Message 33075.  

still, the ERR_NETOPEN failures occur, due to no connection to Condor server:

2017-11-18 16:37:21 (2688): Guest Log: [DEBUG] Testing connection to Condor server on port 9618
2017-11-18 16:37:51 (2688): Guest Log: [DEBUG] nc: connect to vccondor01.cern.ch port 9618 (tcp) timed out: Operation now in progress
2017-11-18 16:37:51 (2688): Guest Log: [DEBUG] 1
2017-11-18 16:37:51 (2688): Guest Log: [ERROR] Could not connect to Condor server on port 9618
2017-11-18 16:37:51 (2688): Guest Log: [INFO] Shutting Down

hopefully, the people at CERN will find out one day what the problem is.

I just wanted to bump this up and report that I've been seeing bursts of these same errors recently. Maybe I should be posting this is the LHCb forum, because, for me, they've been occurring primarily, if not exclusively, on LHCb tasks.
ID: 33158 · Report as offensive     Reply Quote
tullio

Send message
Joined: 19 Feb 08
Posts: 708
Credit: 4,336,250
RAC: 0
Message 33162 - Posted: 29 Nov 2017, 16:17:26 UTC

I am letting a LHCb task and a Theory task on my Windows 10 PC with its 22 GB RAM although I can see from the Task Manager that they are using 0 CPU. Two core Atlas tasks on the same PC run perfectly on its A10-6700 AMD CPU, one core Atlas tasks rune on my 2 Linux boxes with their 8 GB RAM. What is the difference between Atlas tasks and all other LHC tasks (excluding SixTrack, which run perfectly on all PCs)?.
Tullio
ID: 33162 · Report as offensive     Reply Quote
Profile ritterm
Avatar

Send message
Joined: 30 May 08
Posts: 93
Credit: 5,160,246
RAC: 0
Message 33163 - Posted: 29 Nov 2017, 16:29:47 UTC - in response to Message 33158.  

I just wanted to bump this up and report that I've been seeing bursts of these same errors recently. Maybe I should be posting this is the LHCb forum, because, for me, they've been occurring primarily, if not exclusively, on LHCb tasks.

Perhaps I should have added that I'm seeing this LHCb behavior on two 16GB RAM Linux hosts that are running two CMS, two Theory, and two LHCb tasks concurrently. No significant issues with CMS or Theory, that I can tell.
ID: 33163 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,338,329
RAC: 101,962
Message 33688 - Posted: 6 Jan 2018, 13:15:10 UTC

as already posted in the LHCb section, this afternoon many of my LHCb tasks have errored out with either

-152 (0xFFFFFF68) ERR_NETOPEN
or
207 (0x000000CF) EXIT_NO_SUB_TASKS

any idea what's going on at CERN?
ID: 33688 · Report as offensive     Reply Quote

Message boards : Number crunching : -152 (0xFFFFFF68) ERR_NETOPEN and 206 (0x000000CE) EXIT_INIT_FAILURE


©2024 CERN