Message boards : Number crunching : No connection to servers for some time during past few days
Message board moderation

To post messages, you must log in.

AuthorMessage
Erich56

Send message
Joined: 18 Dec 15
Posts: 780
Credit: 5,482,422
RAC: 8,425
Message 32936 - Posted: 30 Oct 2017, 16:53:09 UTC

As also other members may have noticed, during the past few days there have been several cases where neither the LHC homepage could be reached, nor any finished tasks could be uploaded, nor new tasks downloaded.

Has the source for this problem been detected yet?
ID: 32936 · Report as offensive     Reply Quote
marmot
Avatar

Send message
Joined: 5 Nov 15
Posts: 119
Credit: 5,250,392
RAC: 0
Message 32939 - Posted: 31 Oct 2017, 6:57:34 UTC - in response to Message 32936.  

Been having numerous errors as my one server ramps up on Theory.

Some of these VM failures from 10/28 to 10/30 would then be connected to the potential serverside issues (I was hypothesizing that my bandwidth to my ISP was not wide enough even after staggering the VMs' startup with previous jobs from another project in queue).
"Guest Log: SECMAN: no classad from server, failing"
"Guest Log: [ERROR] Could not ping HTCondor."

"Guest Log: ERROR: couldn't locate (null)!"
"Guest Log: [ERROR] Could not ping HTCondor."

"Guest Log: SECMAN:2006:Failed to establish a crypto key."
"Guest Log: [ERROR] Could not ping HTCondor."

"Guest Log: [ERROR] Condor exited after 1741s without running a job."

"Guest Log: [ERROR] Could not connect to Condor server on port 9618"

"Guest Log: [ERROR] Could not connect to cern.ch on port 80"

"Guest Log: [ERROR] Could not connect to lhchomeproxy.cern.ch on port 3125"

"Guest Log: [ERROR] Could not connect to vccs1.cern.ch on port 443"
ID: 32939 · Report as offensive     Reply Quote
Profile Nils Høimyr
Volunteer moderator
Project administrator
Project developer
Project tester

Send message
Joined: 15 Jul 05
Posts: 168
Credit: 2,140,906
RAC: 3,619
Message 32942 - Posted: 31 Oct 2017, 7:21:54 UTC - in response to Message 32936.  
Last modified: 31 Oct 2017, 7:23:31 UTC

According to our monitoring and also tests from home, the LHC@home servers are fully reachable. there might be intermittent issues with your Internet provider.

For the Theory Condor issue, we will check our logs from the last days.
ID: 32942 · Report as offensive     Reply Quote
marmot
Avatar

Send message
Joined: 5 Nov 15
Posts: 119
Credit: 5,250,392
RAC: 0
Message 32943 - Posted: 31 Oct 2017, 7:31:56 UTC - in response to Message 32942.  

According to our monitoring and also tests from home, the LHC@home servers are fully reachable. there might be intermittent issues with your Internet provider.


Trying to run this many Theory WU on my home 30mbit connection, I expected this limit might be reached. Will keep an eye on the failure rates.

For the Theory Condor issue, we will check our logs from the last days.


For clarification, is
"Guest Log: [ERROR] Condor exited after 1741s without running a job."
the Theory Condor issue you refer to?
ID: 32943 · Report as offensive     Reply Quote
gyllic

Send message
Joined: 9 Dec 14
Posts: 149
Credit: 1,660,888
RAC: 1,747
Message 32946 - Posted: 31 Oct 2017, 8:51:58 UTC - in response to Message 32942.  
Last modified: 31 Oct 2017, 8:55:01 UTC

According to our monitoring and also tests from home, the LHC@home servers are fully reachable. there might be intermittent issues with your Internet provider.

I have seen the same behaviour like Erich described. The homepage was reachable but not with the normal UI (just one "page" that gave you some information). It was written that the database server is currently down or not reachable or something like that (if i remember it correctly). It was also not possible to get new work in BOINC during that time. That happened at least 2 times during the last week or so.
ID: 32946 · Report as offensive     Reply Quote
Profile Nils Høimyr
Volunteer moderator
Project administrator
Project developer
Project tester

Send message
Joined: 15 Jul 05
Posts: 168
Credit: 2,140,906
RAC: 3,619
Message 32952 - Posted: 31 Oct 2017, 15:38:21 UTC - in response to Message 32946.  

was written that the database server is currently down or not reachable or something like that (if i remember it correctly). It was also not possible to get new work in BOINC during that time. That happened at least 2 times during the last week or so.


Thanks, that is more helpful. Every day at 17:00 CET (16UTC) there is a scheduled backup that stops the database. This has been there for years, but now takes longer than earlier.
ID: 32952 · Report as offensive     Reply Quote
Profile Nils Høimyr
Volunteer moderator
Project administrator
Project developer
Project tester

Send message
Joined: 15 Jul 05
Posts: 168
Credit: 2,140,906
RAC: 3,619
Message 32958 - Posted: 1 Nov 2017, 14:55:08 UTC - in response to Message 32952.  

Our DB experts tell us that there is a lock due to one of the BOINC daemons that is not released, hence the long backup snapshot time. We are looking into how to get rid of these interruptions.
ID: 32958 · Report as offensive     Reply Quote
gyllic

Send message
Joined: 9 Dec 14
Posts: 149
Credit: 1,660,888
RAC: 1,747
Message 32960 - Posted: 1 Nov 2017, 22:11:22 UTC - in response to Message 32958.  

Our DB experts tell us that there is a lock due to one of the BOINC daemons that is not released, hence the long backup snapshot time. We are looking into how to get rid of these interruptions.

thanks for the information!
ID: 32960 · Report as offensive     Reply Quote
marmot
Avatar

Send message
Joined: 5 Nov 15
Posts: 119
Credit: 5,250,392
RAC: 0
Message 32961 - Posted: 1 Nov 2017, 23:46:43 UTC - in response to Message 32958.  
Last modified: 2 Nov 2017, 0:18:35 UTC

Our DB experts tell us that there is a lock due to one of the BOINC daemons that is not released, hence the long backup snapshot time. We are looking into how to get rid of these interruptions.


It was just before 1 Nov 2017, 14:33:00 UTC that all 60 WU across 3 computers just stopped processing then a string of ~50 ended in the various errors listed above. The last 8 WU's to error out were at 1 Nov 2017, 17:26:27 UTC.

Thanks for the response.
ID: 32961 · Report as offensive     Reply Quote

Message boards : Number crunching : No connection to servers for some time during past few days


©2018 CERN