Message boards : CMS Application : -152 (0xFFFFFF68) ERR_NETOPEN
Message board moderation

To post messages, you must log in.

AuthorMessage
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,364,378
RAC: 101,857
Message 44846 - Posted: 30 Apr 2021, 17:21:30 UTC

For a few hours, tasks error out after serveral minutes with

-152 (0xFFFFFF68) ERR_NETOPEN

2021-04-30 19:06:28 (12532): Guest Log: [ERROR] Could not connect to Condor server on port 9618.

for complete information see here: https://lhcathome.cern.ch/lhcathome/result.php?resultid=315834453
ID: 44846 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,364,378
RAC: 101,857
Message 44905 - Posted: 7 May 2021, 18:14:44 UTC

within the past few hours, there were several cases with various errors:

1 (0x00000001) Unknown error code
2021-05-07 08:46:54 (16352): Guest Log: [ERROR] Condor ended after 21900 seconds.

-152 (0xFFFFFF68) ERR_NETOPEN
2021-05-07 20:02:03 (15476): VM Completion Message: Could not connect to Condor server on port 9618

207 (0x000000CF) EXIT_NO_SUB_TASKS
2021-05-07 09:32:09 (8380): VM Completion Message: No jobs were available to run.

In fact, I never had this mix of failures within such short time.
What's happening back there?
ID: 44905 · Report as offensive     Reply Quote
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 798
Credit: 644,727,650
RAC: 234,163
Message 44906 - Posted: 7 May 2021, 18:49:02 UTC

I can imagine the backend crashed with everyone getting 207 errors, I have 350 since this morning.
ID: 44906 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,364,378
RAC: 101,857
Message 44907 - Posted: 7 May 2021, 19:31:15 UTC

Within the past hour, all newly started tasks errored out after about 8 minutes with:

-152 (0xFFFFFF68) ERR_NETOPEN
2021-05-07 20:26:56 (13276): VM Completion Message: Could not connect to Condor server on port 9618

obviously same problem as last weekend.
ID: 44907 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,364,378
RAC: 101,857
Message 44914 - Posted: 8 May 2021, 4:49:20 UTC - in response to Message 44907.  

obviously same problem as last weekend.
what I then saw this morning (and what also happened last weekend, also at other users machines from what I remember): there were several tasks where suddenly, after a few hours, the task was no longer utilizing the CPU, but continued running to the full time frame of 18 hours, as can be seen in this example:


https://lhcathome.cern.ch/lhcathome/result.php?resultid=316164194

total runtime: 18 hours 7 minutes
CPU time: 6 hours 6 minutes

Does anyone have an explanation for this strange behaviour?
ID: 44914 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,924,304
RAC: 137,677
Message 44915 - Posted: 8 May 2021, 7:14:46 UTC - in response to Message 44914.  

ERR_NETOPEN points out network timing problems.

An old VBox version (5.2.8) may be part of the problem:
https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10555784


total runtime: 18 hours 7 minutes
CPU time: 6 hours 6 minutes

Does anyone have an explanation for this strange behaviour?

Unlike ATLAS CMS uploads intermediate results from within the VM (without using the BOINC client).
While uploads are in progress CPU usage is very low.
Same happens while a VM does the setup for a fresh subtask.
If either your own LAN (wi-fi based?) or the upload connection to CERN is overloaded the uploads/downloads take very long.
ID: 44915 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Nov 14
Posts: 602
Credit: 24,371,321
RAC: 0
Message 44916 - Posted: 8 May 2021, 10:10:56 UTC - in response to Message 44907.  

Within the past hour, all newly started tasks errored out after about 8 minutes with:

-152 (0xFFFFFF68) ERR_NETOPEN
2021-05-07 20:26:56 (13276): VM Completion Message: Could not connect to Condor server on port 9618

I haven't seen the problem at all (on Ubuntu, if that matters).
The CPU is around 95% for CMS, so the CPU run time is normal, though the squid proxy helps a bit.
https://lhcathome.cern.ch/lhcathome/results.php?hostid=10687557&offset=0&show_names=0&state=4&appid=11

But I am about to shut that machine down for the summer, so I won't be getting much more data. Good luck.
ID: 44916 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,364,378
RAC: 101,857
Message 44917 - Posted: 8 May 2021, 12:16:22 UTC - in response to Message 44915.  

ERR_NETOPEN points out network timing problems.

An old VBox version (5.2.8) may be part of the problem:
https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10555784


total runtime: 18 hours 7 minutes
CPU time: 6 hours 6 minutes

Does anyone have an explanation for this strange behaviour?

Unlike ATLAS CMS uploads intermediate results from within the VM (without using the BOINC client).
While uploads are in progress CPU usage is very low.
Same happens while a VM does the setup for a fresh subtask.
If either your own LAN (wi-fi based?) or the upload connection to CERN is overloaded the uploads/downloads take very long.

the problem occurs with a very new VB version (6.1.18) as well - see here
https://lhcathome.cern.ch/lhcathome/result.php?resultid=316167111
runtime 18 hours 7 minutes, CPU time 5 hours 5 minutes

3 PCs are connected via cable-LAN, the notebook via WLAN. However, the connections normally are very okay, and till last weekend this kind of problem did not happen at all.
Besides, it seems strange to me that there is a connection (to CERN) for a couple of hours upon start of a task, and then the task runs without connection for many hours, until the 18 hours time limit is reached (even with tasks which on a fast machine run four about 12 hours). Would one not assume that if there was any kind of connection problem, this would not take many hours? Well, maybe it does ?

I am aware that while uploading interim results, CPU usage is low; but that's a matter of not even a minute (at least with my bandwidth providing an upload speed of 30 Mbit/s).
ID: 44917 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2071
Credit: 156,093,506
RAC: 103,308
Message 44918 - Posted: 8 May 2021, 13:08:13 UTC - in response to Message 44917.  

It's a WIFI timeout after the first CMS-Job inside a CMS-Task.
Tullio have the same problem.
Why the WIFI-Connection is broken after the first job, no idea.
ID: 44918 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Nov 14
Posts: 602
Credit: 24,371,321
RAC: 0
Message 44919 - Posted: 8 May 2021, 13:16:46 UTC - in response to Message 44917.  
Last modified: 8 May 2021, 13:17:08 UTC

ERR_NETOPEN points out network timing problems.

On Windows, it could also be the anti-virus software. Even if you have the BOINC Data folder excluded, the "real time monitoring" often inspects the packets.
It sometimes doesn't like something and shuts it down, or at least delays it for inspection.
ID: 44919 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,924,304
RAC: 137,677
Message 44920 - Posted: 8 May 2021, 14:21:21 UTC - in response to Message 44917.  

The task behind the link does not show the NETOPEN error.

Each task (CMS, ATLAS, Theory) opens/closes thousands of connections to transfer data over the network.
As far as I understand the error message appears when a fresh connection can't be established, of course after a couple of automatic retries that also fail.

The reason can be located on the local network stack, a LAN network device, your internet router, or any other network device between the source and target system.

Since most of the NETOPEN errors are shown on just 1 of your computers it's likely that computer or the way it's connected to your LAN.
ID: 44920 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,364,378
RAC: 101,857
Message 44921 - Posted: 9 May 2021, 5:18:25 UTC - in response to Message 44920.  

The task behind the link does not show the NETOPEN error.

Each task (CMS, ATLAS, Theory) opens/closes thousands of connections to transfer data over the network.
As far as I understand the error message appears when a fresh connection can't be established, of course after a couple of automatic retries that also fail.

The reason can be located on the local network stack, a LAN network device, your internet router, or any other network device between the source and target system.

Since most of the NETOPEN errors are shown on just 1 of your computers it's likely that computer or the way it's connected to your LAN.
Meanwhile, I am having the same problem on all machines running CMS - for example see this one:
https://lhcathome.cern.ch/lhcathome/results.php?hostid=10679599

the problem occurs with the following pattern:

- if the WU cannot connect to condor right at the beginning, it errors out after some 8 minutes.

- if the WU "survives" this initial phase, the connection gets lost some time later during processing of the WU, this can be after short time, or even after several hours. In such a case, the WU is not being terminated, but runs until the 18 hours' limit is reached and then finishes even with earning credit points.
Any then newly downloaded WU, though, does not get a connection to condor to begin with and hence fails after some 8 minutes.

I recently ran ATLAS, no problem with that.
Also, no problem with Theory, regardless of how many WUs I run on all of my machines.
From what I can see (unless I am mistaken): neither ATLAS nor Theory use Condor. So the problem seems to exist between here (local network stack, LAN device, internet router, ... ???) and Condor; only Condor. God knows why :-(
ID: 44921 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,364,378
RAC: 101,857
Message 44922 - Posted: 9 May 2021, 10:55:13 UTC - in response to Message 44921.  

- if the WU "survives" this initial phase, the connection gets lost some time later during processing of the WU, this can be after short time, or even after several hours. In such a case, the WU is not being terminated, but runs until the 18 hours' limit is reached and then finishes even with earning credit points.
just to illustrate what I am talking about, a selection from 3 different machines:

this is a task from a very fast machine which normally finishes a task in less than 12 hours:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=316224656
total runtime 18 hours 6 minutes; CPUtime 1 hour 15 minutes; 625.99 credits

this is a task from a machine which normally finishes a task within 15 hours:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=316221573
total runtime 18 hours 8 minutes; CPU time 5 hours 8 minutes; 506.70 credits

this is a task from a rather slow machine which normally finishes a task in close to 18 hours:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=316221383
total runtime 18 hours 25 minutes; CPU time 9 hours 37 minutes; 468.11 credits

just for testing purposes, on all these machines, I set a ping for condor numerous times - it always succeeded.
ID: 44922 · Report as offensive     Reply Quote

Message boards : CMS Application : -152 (0xFFFFFF68) ERR_NETOPEN


©2024 CERN