Message boards :
CMS Application :
Condor Problems?
Message board moderation
Author | Message |
---|---|
Send message Joined: 15 Jun 08 Posts: 2563 Credit: 257,112,605 RAC: 112,922 |
My currently running WUs lost their condor connection (ports 9623, 9818). Is it a server problem? |
Send message Joined: 27 Sep 08 Posts: 853 Credit: 696,387,302 RAC: 129,979 |
looks like it: 12/14/16 20:09:40 CCBListener: connection to CCB server lcggwms02.gridpp.rl.ac.uk:9623 failed; will try to reconnect in 60 seconds. 12/14/16 20:10:01 attempt to connect to <130.246.180.120:9623> failed: Connection refused (connect errno = 111). 12/14/16 20:10:01 ERROR: SECMAN:2003:TCP connection to collector lcggwms02.gridpp.rl.ac.uk:9623 failed |
Send message Joined: 30 May 08 Posts: 93 Credit: 5,160,246 RAC: 0 |
In case it helps, me too. Logs from a moment ago: 2016-12-14 15:17:46 (21300): Guest Log: [INFO] CMS application starting. Check log files. |
Send message Joined: 20 Jun 14 Posts: 381 Credit: 238,712 RAC: 0 |
Yep, looks like the Condor server at RAL. |
Send message Joined: 29 Aug 05 Posts: 1065 Credit: 8,134,418 RAC: 13,358 |
|
Send message Joined: 15 Jun 08 Posts: 2563 Credit: 257,112,605 RAC: 112,922 |
Some lines from stderr.txt of the currently running WU: 2016-12-15 09:45:55 (13764): Guest Log: [INFO] New Job Starting in slot1 Jobs that finish without an error normally upload a result via a PUT request to vc-cms-output.cs3.cern.ch. Since the condor servers are back I miss this PUT in my logs. There is nothing, not even an aborted upload. |
Send message Joined: 15 Jun 08 Posts: 2563 Credit: 257,112,605 RAC: 112,922 |
At least 1 successful job at 2016-12-15 16:24:06 UTC. BOINC/slots/2/stderr.txt:2016-12-15 17:24:06 (13764): Guest Log: [INFO] Job finished in slot1 with 0. Current error rate is 75%. |
Send message Joined: 29 Aug 05 Posts: 1065 Credit: 8,134,418 RAC: 13,358 |
Some lines from stderr.txt of the currently running WU: I'm afraid you were caught by the bunch of "echoes" from the Condor failure. Jobs completed and staged out during the problem, but Condor was unable to log this. So Condor timed them out and rescheduled (I guess) when it was healthy again, still as attempt 0. When the rescheduled job tried to stage out it found the original result file already there, so it deleted it and reported an error, upon which Condor rescheduled it as attempt 1. :-( |
Send message Joined: 15 Jun 08 Posts: 2563 Credit: 257,112,605 RAC: 112,922 |
So, the error 151 dissappears after a couple of WUs and there´s no need to inform the project admins? It´s one of those things that correct itself "automagically" when the faulty jobs leave the queue, right? Perhaps I reconnected too early. My currently running WUs show 100% success rate: BOINC/slots/4/stderr.txt:2016-12-16 01:34:04 (4480): Guest Log: [INFO] Job finished in slot1 with 0. |
Send message Joined: 29 Aug 05 Posts: 1065 Credit: 8,134,418 RAC: 13,358 |
|
Send message Joined: 30 May 08 Posts: 93 Credit: 5,160,246 RAC: 0 |
More Condor problems? I just noticed several CMS jobs with similar output. 2016-12-20 22:59:37 (29181): Guest Log: [DEBUG] HTCondor ping |
Send message Joined: 20 Jun 14 Posts: 381 Credit: 238,712 RAC: 0 |
More Condor problems?] Looks like it. |
Send message Joined: 29 Aug 05 Posts: 1065 Credit: 8,134,418 RAC: 13,358 |
|
Send message Joined: 29 Aug 05 Posts: 1065 Credit: 8,134,418 RAC: 13,358 |
|
Send message Joined: 28 Sep 04 Posts: 737 Credit: 50,183,923 RAC: 25,081 |
During early hours of 29th of January 5 CMS tasks were unable to load jobs from Condor. Ping was successful (0) but no jobs were sent to host. Earlier (28th) there was no problem and I had finished tasks. Here's one of the failures https://lhcathome.cern.ch/lhcathome/result.php?resultid=117282122 |
Send message Joined: 29 Aug 05 Posts: 1065 Credit: 8,134,418 RAC: 13,358 |
During early hours of 29th of January 5 CMS tasks were unable to load jobs from Condor. Ping was successful (0) but no jobs were sent to host. Earlier (28th) there was no problem and I had finished tasks. Here's one of the failures https://lhcathome.cern.ch/lhcathome/result.php?resultid=117282122 Yes, something's happened with jobs getting to the Condor server. WMStatus says there are plenty of jobs to be sent. Investigating (as well as I can from this far away...) |
Send message Joined: 29 Aug 05 Posts: 1065 Credit: 8,134,418 RAC: 13,358 |
I've submitted a new batch of jobs in case WMStatus is badly reporting the jobs to be sent, but its report coincides with what I see on the Condor server status and is consistent with the behaviour we are seeing. I've notified Laurence. [Edit] The new batch turned up on WMStatus, so I think that is working as expected. [/Edit] |
Send message Joined: 15 Jun 08 Posts: 2563 Credit: 257,112,605 RAC: 112,922 |
CMS urgently needs admin intervention. |
Send message Joined: 29 Aug 05 Posts: 1065 Credit: 8,134,418 RAC: 13,358 |
CMS urgently needs admin intervention. I know, but they're a bit thin on the ground on Sundays. As far as I can tell the WMAgent server is working, and the Condor server is definitely working (I can query it for job details, etc). Communications between them and/or the WMStatus system also appears to be working. I'm now starting to suspect something like a full partition on the Condor server, but I don't think I can query that with condor_* commands -- and it's still serving jobs for the other apps. Sorry, all we can do is wait. |
Send message Joined: 29 Aug 05 Posts: 1065 Credit: 8,134,418 RAC: 13,358 |
|
©2025 CERN