Message boards :
Number crunching :
No connection to servers for some time during past few days
Message board moderation
Author | Message |
---|---|
Send message Joined: 18 Dec 15 Posts: 1689 Credit: 103,910,248 RAC: 121,786 |
As also other members may have noticed, during the past few days there have been several cases where neither the LHC homepage could be reached, nor any finished tasks could be uploaded, nor new tasks downloaded. Has the source for this problem been detected yet? |
Send message Joined: 5 Nov 15 Posts: 144 Credit: 6,301,268 RAC: 0 |
Been having numerous errors as my one server ramps up on Theory. Some of these VM failures from 10/28 to 10/30 would then be connected to the potential serverside issues (I was hypothesizing that my bandwidth to my ISP was not wide enough even after staggering the VMs' startup with previous jobs from another project in queue). "Guest Log: SECMAN: no classad from server, failing" "Guest Log: [ERROR] Could not ping HTCondor." "Guest Log: ERROR: couldn't locate (null)!" "Guest Log: [ERROR] Could not ping HTCondor." "Guest Log: SECMAN:2006:Failed to establish a crypto key." "Guest Log: [ERROR] Could not ping HTCondor." "Guest Log: [ERROR] Condor exited after 1741s without running a job." "Guest Log: [ERROR] Could not connect to Condor server on port 9618" "Guest Log: [ERROR] Could not connect to cern.ch on port 80" "Guest Log: [ERROR] Could not connect to lhchomeproxy.cern.ch on port 3125" "Guest Log: [ERROR] Could not connect to vccs1.cern.ch on port 443" |
Send message Joined: 15 Jul 05 Posts: 242 Credit: 5,800,306 RAC: 0 |
According to our monitoring and also tests from home, the LHC@home servers are fully reachable. there might be intermittent issues with your Internet provider. For the Theory Condor issue, we will check our logs from the last days. |
Send message Joined: 5 Nov 15 Posts: 144 Credit: 6,301,268 RAC: 0 |
According to our monitoring and also tests from home, the LHC@home servers are fully reachable. there might be intermittent issues with your Internet provider. Trying to run this many Theory WU on my home 30mbit connection, I expected this limit might be reached. Will keep an eye on the failure rates. For the Theory Condor issue, we will check our logs from the last days. For clarification, is "Guest Log: [ERROR] Condor exited after 1741s without running a job." the Theory Condor issue you refer to? |
Send message Joined: 9 Dec 14 Posts: 202 Credit: 2,533,875 RAC: 0 |
According to our monitoring and also tests from home, the LHC@home servers are fully reachable. there might be intermittent issues with your Internet provider. I have seen the same behaviour like Erich described. The homepage was reachable but not with the normal UI (just one "page" that gave you some information). It was written that the database server is currently down or not reachable or something like that (if i remember it correctly). It was also not possible to get new work in BOINC during that time. That happened at least 2 times during the last week or so. |
Send message Joined: 15 Jul 05 Posts: 242 Credit: 5,800,306 RAC: 0 |
was written that the database server is currently down or not reachable or something like that (if i remember it correctly). It was also not possible to get new work in BOINC during that time. That happened at least 2 times during the last week or so. Thanks, that is more helpful. Every day at 17:00 CET (16UTC) there is a scheduled backup that stops the database. This has been there for years, but now takes longer than earlier. |
Send message Joined: 15 Jul 05 Posts: 242 Credit: 5,800,306 RAC: 0 |
Our DB experts tell us that there is a lock due to one of the BOINC daemons that is not released, hence the long backup snapshot time. We are looking into how to get rid of these interruptions. |
Send message Joined: 9 Dec 14 Posts: 202 Credit: 2,533,875 RAC: 0 |
Our DB experts tell us that there is a lock due to one of the BOINC daemons that is not released, hence the long backup snapshot time. We are looking into how to get rid of these interruptions. thanks for the information! |
Send message Joined: 5 Nov 15 Posts: 144 Credit: 6,301,268 RAC: 0 |
Our DB experts tell us that there is a lock due to one of the BOINC daemons that is not released, hence the long backup snapshot time. We are looking into how to get rid of these interruptions. It was just before 1 Nov 2017, 14:33:00 UTC that all 60 WU across 3 computers just stopped processing then a string of ~50 ended in the various errors listed above. The last 8 WU's to error out were at 1 Nov 2017, 17:26:27 UTC. Thanks for the response. |
©2024 CERN