Thread '-152 (0xFFFFFF68) ERR

Author	Message
Erich56 Send message Joined: 18 Dec 15 Posts: 1976 Credit: 159,763,349 RAC: 51,850	Message 44846 - Posted: 30 Apr 2021, 17:21:30 UTC For a few hours, tasks error out after serveral minutes with -152 (0xFFFFFF68) ERR_NETOPEN 2021-04-30 19:06:28 (12532): Guest Log: [ERROR] Could not connect to Condor server on port 9618. for complete information see here: https://lhcathome.cern.ch/lhcathome/result.php?resultid=315834453 ID: 44846 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1976 Credit: 159,763,349 RAC: 51,850	Message 44905 - Posted: 7 May 2021, 18:14:44 UTC within the past few hours, there were several cases with various errors: 1 (0x00000001) Unknown error code 2021-05-07 08:46:54 (16352): Guest Log: [ERROR] Condor ended after 21900 seconds. -152 (0xFFFFFF68) ERR_NETOPEN 2021-05-07 20:02:03 (15476): VM Completion Message: Could not connect to Condor server on port 9618 207 (0x000000CF) EXIT_NO_SUB_TASKS 2021-05-07 09:32:09 (8380): VM Completion Message: No jobs were available to run. In fact, I never had this mix of failures within such short time. What's happening back there? ID: 44905 · Reply Quote

Toby Broom Volunteer moderator Send message Joined: 27 Sep 08 Posts: 925 Credit: 780,120,175 RAC: 130,998	Message 44906 - Posted: 7 May 2021, 18:49:02 UTC I can imagine the backend crashed with everyone getting 207 errors, I have 350 since this morning. ID: 44906 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1976 Credit: 159,763,349 RAC: 51,850	Message 44907 - Posted: 7 May 2021, 19:31:15 UTC Within the past hour, all newly started tasks errored out after about 8 minutes with: -152 (0xFFFFFF68) ERR_NETOPEN 2021-05-07 20:26:56 (13276): VM Completion Message: Could not connect to Condor server on port 9618 obviously same problem as last weekend. ID: 44907 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1976 Credit: 159,763,349 RAC: 51,850	Message 44914 - Posted: 8 May 2021, 4:49:20 UTC - in response to Message 44907. obviously same problem as last weekend. what I then saw this morning (and what also happened last weekend, also at other users machines from what I remember): there were several tasks where suddenly, after a few hours, the task was no longer utilizing the CPU, but continued running to the full time frame of 18 hours, as can be seen in this example: https://lhcathome.cern.ch/lhcathome/result.php?resultid=316164194 total runtime: 18 hours 7 minutes CPU time: 6 hours 6 minutes Does anyone have an explanation for this strange behaviour? ID: 44914 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2726 Credit: 300,483,557 RAC: 53,076	Message 44915 - Posted: 8 May 2021, 7:14:46 UTC - in response to Message 44914. ERR_NETOPEN points out network timing problems. An old VBox version (5.2.8) may be part of the problem: https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10555784 total runtime: 18 hours 7 minutes CPU time: 6 hours 6 minutes Does anyone have an explanation for this strange behaviour? Unlike ATLAS CMS uploads intermediate results from within the VM (without using the BOINC client). While uploads are in progress CPU usage is very low. Same happens while a VM does the setup for a fresh subtask. If either your own LAN (wi-fi based?) or the upload connection to CERN is overloaded the uploads/downloads take very long. ID: 44915 · Reply Quote

Jim1348 Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0	Message 44916 - Posted: 8 May 2021, 10:10:56 UTC - in response to Message 44907. Within the past hour, all newly started tasks errored out after about 8 minutes with: -152 (0xFFFFFF68) ERR_NETOPEN 2021-05-07 20:26:56 (13276): VM Completion Message: Could not connect to Condor server on port 9618 I haven't seen the problem at all (on Ubuntu, if that matters). The CPU is around 95% for CMS, so the CPU run time is normal, though the squid proxy helps a bit. https://lhcathome.cern.ch/lhcathome/results.php?hostid=10687557&offset=0&show_names=0&state=4&appid=11 But I am about to shut that machine down for the summer, so I won't be getting much more data. Good luck. ID: 44916 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1976 Credit: 159,763,349 RAC: 51,850	Message 44917 - Posted: 8 May 2021, 12:16:22 UTC - in response to Message 44915. ERR_NETOPEN points out network timing problems. An old VBox version (5.2.8) may be part of the problem: https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10555784 total runtime: 18 hours 7 minutes CPU time: 6 hours 6 minutes Does anyone have an explanation for this strange behaviour? Unlike ATLAS CMS uploads intermediate results from within the VM (without using the BOINC client). While uploads are in progress CPU usage is very low. Same happens while a VM does the setup for a fresh subtask. If either your own LAN (wi-fi based?) or the upload connection to CERN is overloaded the uploads/downloads take very long. the problem occurs with a very new VB version (6.1.18) as well - see here https://lhcathome.cern.ch/lhcathome/result.php?resultid=316167111 runtime 18 hours 7 minutes, CPU time 5 hours 5 minutes 3 PCs are connected via cable-LAN, the notebook via WLAN. However, the connections normally are very okay, and till last weekend this kind of problem did not happen at all. Besides, it seems strange to me that there is a connection (to CERN) for a couple of hours upon start of a task, and then the task runs without connection for many hours, until the 18 hours time limit is reached (even with tasks which on a fast machine run four about 12 hours). Would one not assume that if there was any kind of connection problem, this would not take many hours? Well, maybe it does ? I am aware that while uploading interim results, CPU usage is low; but that's a matter of not even a minute (at least with my bandwidth providing an upload speed of 30 Mbit/s). ID: 44917 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2290 Credit: 178,872,776 RAC: 2,621	Message 44918 - Posted: 8 May 2021, 13:08:13 UTC - in response to Message 44917. It's a WIFI timeout after the first CMS-Job inside a CMS-Task. Tullio have the same problem. Why the WIFI-Connection is broken after the first job, no idea. ID: 44918 · Reply Quote

Jim1348 Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0	Message 44919 - Posted: 8 May 2021, 13:16:46 UTC - in response to Message 44917. Last modified: 8 May 2021, 13:17:08 UTC ERR_NETOPEN points out network timing problems. On Windows, it could also be the anti-virus software. Even if you have the BOINC Data folder excluded, the "real time monitoring" often inspects the packets. It sometimes doesn't like something and shuts it down, or at least delays it for inspection. ID: 44919 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2726 Credit: 300,483,557 RAC: 53,076	Message 44920 - Posted: 8 May 2021, 14:21:21 UTC - in response to Message 44917. The task behind the link does not show the NETOPEN error. Each task (CMS, ATLAS, Theory) opens/closes thousands of connections to transfer data over the network. As far as I understand the error message appears when a fresh connection can't be established, of course after a couple of automatic retries that also fail. The reason can be located on the local network stack, a LAN network device, your internet router, or any other network device between the source and target system. Since most of the NETOPEN errors are shown on just 1 of your computers it's likely that computer or the way it's connected to your LAN. ID: 44920 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1976 Credit: 159,763,349 RAC: 51,850	Message 44921 - Posted: 9 May 2021, 5:18:25 UTC - in response to Message 44920. The task behind the link does not show the NETOPEN error. Each task (CMS, ATLAS, Theory) opens/closes thousands of connections to transfer data over the network. As far as I understand the error message appears when a fresh connection can't be established, of course after a couple of automatic retries that also fail. The reason can be located on the local network stack, a LAN network device, your internet router, or any other network device between the source and target system. Since most of the NETOPEN errors are shown on just 1 of your computers it's likely that computer or the way it's connected to your LAN. Meanwhile, I am having the same problem on all machines running CMS - for example see this one: https://lhcathome.cern.ch/lhcathome/results.php?hostid=10679599 the problem occurs with the following pattern: - if the WU cannot connect to condor right at the beginning, it errors out after some 8 minutes. - if the WU "survives" this initial phase, the connection gets lost some time later during processing of the WU, this can be after short time, or even after several hours. In such a case, the WU is not being terminated, but runs until the 18 hours' limit is reached and then finishes even with earning credit points. Any then newly downloaded WU, though, does not get a connection to condor to begin with and hence fails after some 8 minutes. I recently ran ATLAS, no problem with that. Also, no problem with Theory, regardless of how many WUs I run on all of my machines. From what I can see (unless I am mistaken): neither ATLAS nor Theory use Condor. So the problem seems to exist between here (local network stack, LAN device, internet router, ... ???) and Condor; only Condor. God knows why :-( ID: 44921 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1976 Credit: 159,763,349 RAC: 51,850	Message 44922 - Posted: 9 May 2021, 10:55:13 UTC - in response to Message 44921. - if the WU "survives" this initial phase, the connection gets lost some time later during processing of the WU, this can be after short time, or even after several hours. In such a case, the WU is not being terminated, but runs until the 18 hours' limit is reached and then finishes even with earning credit points. just to illustrate what I am talking about, a selection from 3 different machines: this is a task from a very fast machine which normally finishes a task in less than 12 hours: https://lhcathome.cern.ch/lhcathome/result.php?resultid=316224656 total runtime 18 hours 6 minutes; CPUtime 1 hour 15 minutes; 625.99 credits this is a task from a machine which normally finishes a task within 15 hours: https://lhcathome.cern.ch/lhcathome/result.php?resultid=316221573 total runtime 18 hours 8 minutes; CPU time 5 hours 8 minutes; 506.70 credits this is a task from a rather slow machine which normally finishes a task in close to 18 hours: https://lhcathome.cern.ch/lhcathome/result.php?resultid=316221383 total runtime 18 hours 25 minutes; CPU time 9 hours 37 minutes; 468.11 credits just for testing purposes, on all these machines, I set a ping for condor numerous times - it always succeeded. ID: 44922 · Reply Quote