Message boards : LHCb Application : -152 (0xFFFFFF68) ERR_NETOPEN
Message board moderation

To post messages, you must log in.

AuthorMessage
Erich56

Send message
Joined: 18 Dec 15
Posts: 1688
Credit: 103,684,204
RAC: 121,595
Message 33678 - Posted: 6 Jan 2018, 6:44:34 UTC

lately, some of my LHCb tasks failed after about 7 minutes with

-152 (0xFFFFFF68) ERR_NETOPEN

stderr says: [ERROR] Could not connect to Condor server on port 9618

In the past, I experienced the same problem many times with CMS tasks, which apparently also connect to the Condor server, and this seems to fail from time to time.
Any idea why this happens? A flaw with the Condor server?
ID: 33678 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1688
Credit: 103,684,204
RAC: 121,595
Message 33684 - Posted: 6 Jan 2018, 10:55:16 UTC - in response to Message 33678.  

a few minutes ago, another task failed with the same error code as quoted above.

Besides, two other tasks failed with error code: 207 (0x000000CF) EXIT_NO_SUB_TASKS
in STDERR it says: [ERROR] No jobs were available to run

so, obviously, there are enough tasks for download, but no jobs to be crunched by these tasks :-(
An experience which I have often made with CMS :-(
ID: 33684 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1688
Credit: 103,684,204
RAC: 121,595
Message 33687 - Posted: 6 Jan 2018, 13:12:26 UTC

meanwhile, I got serveral more task failures with either

-152 (0xFFFFFF68) ERR_NETOPEN
or
207 (0x000000CF) EXIT_NO_SUB_TASKS

what's going on at CERN? Is the system at the verge of a breakdown?
ID: 33687 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1688
Credit: 103,684,204
RAC: 121,595
Message 33691 - Posted: 6 Jan 2018, 18:57:27 UTC

the above cited failures are getting more and more.
No idea what problems persist at CERN.

I will cease crunching as I do not want to waste my processors and my electricity for nothing.
Really annoying :-(
ID: 33691 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Nov 14
Posts: 602
Credit: 24,371,321
RAC: 0
Message 33693 - Posted: 6 Jan 2018, 21:56:59 UTC - in response to Message 33678.  
Last modified: 6 Jan 2018, 22:53:49 UTC

lately, some of my LHCb tasks failed after about 7 minutes with

-152 (0xFFFFFF68) ERR_NETOPEN

stderr says: [ERROR] Could not connect to Condor server on port 9618

In the past, I experienced the same problem many times with CMS tasks, which apparently also connect to the Condor server, and this seems to fail from time to time.
Any idea why this happens? A flaw with the Condor server?

I don't see it here on LHCb.
https://lhcathome.cern.ch/lhcathome/results.php?hostid=10501634&offset=0&show_names=0&state=0&appid=12

And I looked at a few on CMS, and don't see it there either.
https://lhcathome.cern.ch/lhcathome/results.php?hostid=10501634&offset=0&show_names=0&state=0&appid=11

Maybe it is a network problem that somehow affects that port?

EDIT: There was one today on LHCb.
https://lhcathome.cern.ch/lhcathome/result.php?resultid=172154298
ID: 33693 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1688
Credit: 103,684,204
RAC: 121,595
Message 33696 - Posted: 7 Jan 2018, 7:29:07 UTC - in response to Message 33693.  

EDIT: There was one today on LHCb.
https://lhcathome.cern.ch/lhcathome/result.php?resultid=172154298
2018-01-06 05:55:21 (4990): Guest Log: [ERROR] Could not connect to Condor server on port 9618
2018-01-06 05:55:21 (4990): Guest Log: [INFO] Shutting Down.
2018-01-06 05:55:21 (4990): VM Completion File Detected.
2018-01-06 05:55:21 (4990): VM Completion Message: Could not connect to Condor server on port 9618


Jim, this is exactly the type of error I have gotten so many times, also with CMS tasks. There seems to be some kind of persisting problem with the Condor Server, for long time now :-(
ID: 33696 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Nov 14
Posts: 602
Credit: 24,371,321
RAC: 0
Message 33697 - Posted: 7 Jan 2018, 7:58:22 UTC - in response to Message 33696.  

Jim, this is exactly the type of error I have gotten so many times, also with CMS tasks. There seems to be some kind of persisting problem with the Condor Server, for long time now :-(

Maybe it is related to path delay if they use some sort of triggered port on their server (?). The timing must be critical for there to be a difference between us.
ID: 33697 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1688
Credit: 103,684,204
RAC: 121,595
Message 33698 - Posted: 7 Jan 2018, 8:05:38 UTC - in response to Message 33697.  

Maybe it is related to path delay if they use some sort of triggered port on their server (?).
hm, maybe so ...
I already brought this problem to the attention of Ivan (CMS guy), he said he'll try to have somone look into it some time.

It's frustrating when so many tasks fail due to some flaw with the Condor Server.
ID: 33698 · Report as offensive     Reply Quote

Message boards : LHCb Application : -152 (0xFFFFFF68) ERR_NETOPEN


©2024 CERN