Thread 'No connection to servers for some time during past few days'

Author	Message
Erich56 Send message Joined: 18 Dec 15 Posts: 1980 Credit: 160,774,960 RAC: 39,648	Message 32936 - Posted: 30 Oct 2017, 16:53:09 UTC As also other members may have noticed, during the past few days there have been several cases where neither the LHC homepage could be reached, nor any finished tasks could be uploaded, nor new tasks downloaded. Has the source for this problem been detected yet? ID: 32936 · Reply Quote

marmot Send message Joined: 5 Nov 15 Posts: 144 Credit: 6,301,268 RAC: 0	Message 32939 - Posted: 31 Oct 2017, 6:57:34 UTC - in response to Message 32936. Been having numerous errors as my one server ramps up on Theory. Some of these VM failures from 10/28 to 10/30 would then be connected to the potential serverside issues (I was hypothesizing that my bandwidth to my ISP was not wide enough even after staggering the VMs' startup with previous jobs from another project in queue). "Guest Log: SECMAN: no classad from server, failing" "Guest Log: [ERROR] Could not ping HTCondor." "Guest Log: ERROR: couldn't locate (null)!" "Guest Log: [ERROR] Could not ping HTCondor." "Guest Log: SECMAN:2006:Failed to establish a crypto key." "Guest Log: [ERROR] Could not ping HTCondor." "Guest Log: [ERROR] Condor exited after 1741s without running a job." "Guest Log: [ERROR] Could not connect to Condor server on port 9618" "Guest Log: [ERROR] Could not connect to cern.ch on port 80" "Guest Log: [ERROR] Could not connect to lhchomeproxy.cern.ch on port 3125" "Guest Log: [ERROR] Could not connect to vccs1.cern.ch on port 443" ID: 32939 · Reply Quote

Nils Volunteer moderator Project administrator Project developer Project tester Send message Joined: 15 Jul 05 Posts: 254 Credit: 6,001,083 RAC: 0	Message 32942 - Posted: 31 Oct 2017, 7:21:54 UTC - in response to Message 32936. Last modified: 31 Oct 2017, 7:23:31 UTC According to our monitoring and also tests from home, the LHC@home servers are fully reachable. there might be intermittent issues with your Internet provider. For the Theory Condor issue, we will check our logs from the last days. ID: 32942 · Reply Quote

marmot Send message Joined: 5 Nov 15 Posts: 144 Credit: 6,301,268 RAC: 0	Message 32943 - Posted: 31 Oct 2017, 7:31:56 UTC - in response to Message 32942. According to our monitoring and also tests from home, the LHC@home servers are fully reachable. there might be intermittent issues with your Internet provider. Trying to run this many Theory WU on my home 30mbit connection, I expected this limit might be reached. Will keep an eye on the failure rates. For the Theory Condor issue, we will check our logs from the last days. For clarification, is "Guest Log: [ERROR] Condor exited after 1741s without running a job." the Theory Condor issue you refer to? ID: 32943 · Reply Quote

gyllic Send message Joined: 9 Dec 14 Posts: 202 Credit: 2,660,212 RAC: 10	Message 32946 - Posted: 31 Oct 2017, 8:51:58 UTC - in response to Message 32942. Last modified: 31 Oct 2017, 8:55:01 UTC According to our monitoring and also tests from home, the LHC@home servers are fully reachable. there might be intermittent issues with your Internet provider. I have seen the same behaviour like Erich described. The homepage was reachable but not with the normal UI (just one "page" that gave you some information). It was written that the database server is currently down or not reachable or something like that (if i remember it correctly). It was also not possible to get new work in BOINC during that time. That happened at least 2 times during the last week or so. ID: 32946 · Reply Quote

Nils Volunteer moderator Project administrator Project developer Project tester Send message Joined: 15 Jul 05 Posts: 254 Credit: 6,001,083 RAC: 0	Message 32952 - Posted: 31 Oct 2017, 15:38:21 UTC - in response to Message 32946. was written that the database server is currently down or not reachable or something like that (if i remember it correctly). It was also not possible to get new work in BOINC during that time. That happened at least 2 times during the last week or so. Thanks, that is more helpful. Every day at 17:00 CET (16UTC) there is a scheduled backup that stops the database. This has been there for years, but now takes longer than earlier. ID: 32952 · Reply Quote

Nils Volunteer moderator Project administrator Project developer Project tester Send message Joined: 15 Jul 05 Posts: 254 Credit: 6,001,083 RAC: 0	Message 32958 - Posted: 1 Nov 2017, 14:55:08 UTC - in response to Message 32952. Our DB experts tell us that there is a lock due to one of the BOINC daemons that is not released, hence the long backup snapshot time. We are looking into how to get rid of these interruptions. ID: 32958 · Reply Quote

gyllic Send message Joined: 9 Dec 14 Posts: 202 Credit: 2,660,212 RAC: 10	Message 32960 - Posted: 1 Nov 2017, 22:11:22 UTC - in response to Message 32958. Our DB experts tell us that there is a lock due to one of the BOINC daemons that is not released, hence the long backup snapshot time. We are looking into how to get rid of these interruptions. thanks for the information! ID: 32960 · Reply Quote

marmot Send message Joined: 5 Nov 15 Posts: 144 Credit: 6,301,268 RAC: 0	Message 32961 - Posted: 1 Nov 2017, 23:46:43 UTC - in response to Message 32958. Last modified: 2 Nov 2017, 0:18:35 UTC Our DB experts tell us that there is a lock due to one of the BOINC daemons that is not released, hence the long backup snapshot time. We are looking into how to get rid of these interruptions. It was just before 1 Nov 2017, 14:33:00 UTC that all 60 WU across 3 computers just stopped processing then a string of ~50 ended in the various errors listed above. The last 8 WU's to error out were at 1 Nov 2017, 17:26:27 UTC. Thanks for the response. ID: 32961 · Reply Quote