Message boards :
CMS Application :
-152 (0xFFFFFF68) ERR_NETOPEN
Message board moderation
Author | Message |
---|---|
Send message Joined: 18 Dec 15 Posts: 1623 Credit: 86,360,653 RAC: 141,042 ![]() ![]() ![]() |
For a few hours, tasks error out after serveral minutes with -152 (0xFFFFFF68) ERR_NETOPEN 2021-04-30 19:06:28 (12532): Guest Log: [ERROR] Could not connect to Condor server on port 9618. for complete information see here: https://lhcathome.cern.ch/lhcathome/result.php?resultid=315834453 |
Send message Joined: 18 Dec 15 Posts: 1623 Credit: 86,360,653 RAC: 141,042 ![]() ![]() ![]() |
within the past few hours, there were several cases with various errors: 1 (0x00000001) Unknown error code 2021-05-07 08:46:54 (16352): Guest Log: [ERROR] Condor ended after 21900 seconds. -152 (0xFFFFFF68) ERR_NETOPEN 2021-05-07 20:02:03 (15476): VM Completion Message: Could not connect to Condor server on port 9618 207 (0x000000CF) EXIT_NO_SUB_TASKS 2021-05-07 09:32:09 (8380): VM Completion Message: No jobs were available to run. In fact, I never had this mix of failures within such short time. What's happening back there? |
Send message Joined: 27 Sep 08 Posts: 780 Credit: 621,002,756 RAC: 211,062 ![]() ![]() ![]() |
I can imagine the backend crashed with everyone getting 207 errors, I have 350 since this morning. |
Send message Joined: 18 Dec 15 Posts: 1623 Credit: 86,360,653 RAC: 141,042 ![]() ![]() ![]() |
Within the past hour, all newly started tasks errored out after about 8 minutes with: -152 (0xFFFFFF68) ERR_NETOPEN 2021-05-07 20:26:56 (13276): VM Completion Message: Could not connect to Condor server on port 9618 obviously same problem as last weekend. |
Send message Joined: 18 Dec 15 Posts: 1623 Credit: 86,360,653 RAC: 141,042 ![]() ![]() ![]() |
obviously same problem as last weekend.what I then saw this morning (and what also happened last weekend, also at other users machines from what I remember): there were several tasks where suddenly, after a few hours, the task was no longer utilizing the CPU, but continued running to the full time frame of 18 hours, as can be seen in this example: https://lhcathome.cern.ch/lhcathome/result.php?resultid=316164194 total runtime: 18 hours 7 minutes CPU time: 6 hours 6 minutes Does anyone have an explanation for this strange behaviour? |
![]() Send message Joined: 15 Jun 08 Posts: 2287 Credit: 208,525,595 RAC: 140,787 ![]() ![]() ![]() |
ERR_NETOPEN points out network timing problems. An old VBox version (5.2.8) may be part of the problem: https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10555784 total runtime: 18 hours 7 minutes Unlike ATLAS CMS uploads intermediate results from within the VM (without using the BOINC client). While uploads are in progress CPU usage is very low. Same happens while a VM does the setup for a fresh subtask. If either your own LAN (wi-fi based?) or the upload connection to CERN is overloaded the uploads/downloads take very long. |
Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0 ![]() ![]() |
Within the past hour, all newly started tasks errored out after about 8 minutes with: I haven't seen the problem at all (on Ubuntu, if that matters). The CPU is around 95% for CMS, so the CPU run time is normal, though the squid proxy helps a bit. https://lhcathome.cern.ch/lhcathome/results.php?hostid=10687557&offset=0&show_names=0&state=4&appid=11 But I am about to shut that machine down for the summer, so I won't be getting much more data. Good luck. |
Send message Joined: 18 Dec 15 Posts: 1623 Credit: 86,360,653 RAC: 141,042 ![]() ![]() ![]() |
ERR_NETOPEN points out network timing problems. the problem occurs with a very new VB version (6.1.18) as well - see here https://lhcathome.cern.ch/lhcathome/result.php?resultid=316167111 runtime 18 hours 7 minutes, CPU time 5 hours 5 minutes 3 PCs are connected via cable-LAN, the notebook via WLAN. However, the connections normally are very okay, and till last weekend this kind of problem did not happen at all. Besides, it seems strange to me that there is a connection (to CERN) for a couple of hours upon start of a task, and then the task runs without connection for many hours, until the 18 hours time limit is reached (even with tasks which on a fast machine run four about 12 hours). Would one not assume that if there was any kind of connection problem, this would not take many hours? Well, maybe it does ? I am aware that while uploading interim results, CPU usage is low; but that's a matter of not even a minute (at least with my bandwidth providing an upload speed of 30 Mbit/s). |
Send message Joined: 2 May 07 Posts: 1886 Credit: 146,177,884 RAC: 116,273 ![]() ![]() ![]() |
It's a WIFI timeout after the first CMS-Job inside a CMS-Task. Tullio have the same problem. Why the WIFI-Connection is broken after the first job, no idea. |
Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0 ![]() ![]() |
ERR_NETOPEN points out network timing problems. On Windows, it could also be the anti-virus software. Even if you have the BOINC Data folder excluded, the "real time monitoring" often inspects the packets. It sometimes doesn't like something and shuts it down, or at least delays it for inspection. |
![]() Send message Joined: 15 Jun 08 Posts: 2287 Credit: 208,525,595 RAC: 140,787 ![]() ![]() ![]() |
The task behind the link does not show the NETOPEN error. Each task (CMS, ATLAS, Theory) opens/closes thousands of connections to transfer data over the network. As far as I understand the error message appears when a fresh connection can't be established, of course after a couple of automatic retries that also fail. The reason can be located on the local network stack, a LAN network device, your internet router, or any other network device between the source and target system. Since most of the NETOPEN errors are shown on just 1 of your computers it's likely that computer or the way it's connected to your LAN. |
Send message Joined: 18 Dec 15 Posts: 1623 Credit: 86,360,653 RAC: 141,042 ![]() ![]() ![]() |
The task behind the link does not show the NETOPEN error.Meanwhile, I am having the same problem on all machines running CMS - for example see this one: https://lhcathome.cern.ch/lhcathome/results.php?hostid=10679599 the problem occurs with the following pattern: - if the WU cannot connect to condor right at the beginning, it errors out after some 8 minutes. - if the WU "survives" this initial phase, the connection gets lost some time later during processing of the WU, this can be after short time, or even after several hours. In such a case, the WU is not being terminated, but runs until the 18 hours' limit is reached and then finishes even with earning credit points. Any then newly downloaded WU, though, does not get a connection to condor to begin with and hence fails after some 8 minutes. I recently ran ATLAS, no problem with that. Also, no problem with Theory, regardless of how many WUs I run on all of my machines. From what I can see (unless I am mistaken): neither ATLAS nor Theory use Condor. So the problem seems to exist between here (local network stack, LAN device, internet router, ... ???) and Condor; only Condor. God knows why :-( |
Send message Joined: 18 Dec 15 Posts: 1623 Credit: 86,360,653 RAC: 141,042 ![]() ![]() ![]() |
- if the WU "survives" this initial phase, the connection gets lost some time later during processing of the WU, this can be after short time, or even after several hours. In such a case, the WU is not being terminated, but runs until the 18 hours' limit is reached and then finishes even with earning credit points.just to illustrate what I am talking about, a selection from 3 different machines: this is a task from a very fast machine which normally finishes a task in less than 12 hours: https://lhcathome.cern.ch/lhcathome/result.php?resultid=316224656 total runtime 18 hours 6 minutes; CPUtime 1 hour 15 minutes; 625.99 credits this is a task from a machine which normally finishes a task within 15 hours: https://lhcathome.cern.ch/lhcathome/result.php?resultid=316221573 total runtime 18 hours 8 minutes; CPU time 5 hours 8 minutes; 506.70 credits this is a task from a rather slow machine which normally finishes a task in close to 18 hours: https://lhcathome.cern.ch/lhcathome/result.php?resultid=316221383 total runtime 18 hours 25 minutes; CPU time 9 hours 37 minutes; 468.11 credits just for testing purposes, on all these machines, I set a ping for condor numerous times - it always succeeded. |
©2023 CERN