Condor Problems?

Author	Message
computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2681 Credit: 286,872,323 RAC: 58,502	Message 28116 - Posted: 14 Dec 2016, 18:39:44 UTC My currently running WUs lost their condor connection (ports 9623, 9818). Is it a server problem? ID: 28116 · Reply Quote

Toby Broom Volunteer moderator Send message Joined: 27 Sep 08 Posts: 880 Credit: 746,625,445 RAC: 323,858	Message 28117 - Posted: 14 Dec 2016, 19:11:19 UTC looks like it: 12/14/16 20:09:40 CCBListener: connection to CCB server lcggwms02.gridpp.rl.ac.uk:9623 failed; will try to reconnect in 60 seconds. 12/14/16 20:10:01 attempt to connect to <130.246.180.120:9623> failed: Connection refused (connect errno = 111). 12/14/16 20:10:01 ERROR: SECMAN:2003:TCP connection to collector lcggwms02.gridpp.rl.ac.uk:9623 failed ID: 28117 · Reply Quote

ritterm Send message Joined: 30 May 08 Posts: 93 Credit: 5,160,246 RAC: 0	Message 28119 - Posted: 14 Dec 2016, 20:26:02 UTC In case it helps, me too. Logs from a moment ago: 2016-12-14 15:17:46 (21300): Guest Log: [INFO] CMS application starting. Check log files. 2016-12-14 15:17:46 (21300): Guest Log: [DEBUG] HTCondor ping 2016-12-14 15:17:47 (21300): Guest Log: [DEBUG] 1 2016-12-14 15:17:47 (21300): Guest Log: [DEBUG] 12/14/16 15:17:46 recognized DC_NOP as command name, using command 60011. 2016-12-14 15:17:47 (21300): Guest Log: 12/14/16 15:17:46 attempt to connect to <130.246.180.120:9623> failed: Connection refused (connect errno = 111). 2016-12-14 15:17:47 (21300): Guest Log: ERROR: failed to make connection to <130.246.180.120:9623> 2016-12-14 15:17:47 (21300): Guest Log: [ERROR] Could not ping HTCondor. 2016-12-14 15:17:47 (21300): Guest Log: [INFO] Shutting Down. ID: 28119 · Reply Quote

Laurence Project administrator Project developer Send message Joined: 20 Jun 14 Posts: 407 Credit: 238,712 RAC: 0	Message 28120 - Posted: 14 Dec 2016, 23:19:55 UTC - in response to Message 28119. Yep, looks like the Condor server at RAL. ID: 28120 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1110 Credit: 9,434,225 RAC: 8,167	Message 28121 - Posted: 15 Dec 2016, 0:00:25 UTC - in response to Message 28120. Yep, looks like the Condor server at RAL. More info here. ID: 28121 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2681 Credit: 286,872,323 RAC: 58,502	Message 28124 - Posted: 15 Dec 2016, 14:51:15 UTC Some lines from stderr.txt of the currently running WU: 2016-12-15 09:45:55 (13764): Guest Log: [INFO] New Job Starting in slot1 2016-12-15 09:45:55 (13764): Guest Log: [INFO] Condor JobID: 1806162 in slot1 2016-12-15 09:46:10 (13764): Guest Log: [INFO] CRAB ID: 6604 in slot1 2016-12-15 11:37:34 (13764): Guest Log: [INFO] Job finished in slot1 with 151. 2016-12-15 11:37:41 (13764): Guest Log: [INFO] New Job Starting in slot1 2016-12-15 11:37:41 (13764): Guest Log: [INFO] Condor JobID: 1805854 in slot1 2016-12-15 11:37:51 (13764): Guest Log: [INFO] CRAB ID: 6313 in slot1 2016-12-15 13:24:07 (13764): Guest Log: [INFO] Job finished in slot1 with 151. 2016-12-15 13:24:14 (13764): Guest Log: [INFO] New Job Starting in slot1 2016-12-15 13:24:14 (13764): Guest Log: [INFO] Condor JobID: 1806819 in slot1 2016-12-15 13:24:23 (13764): Guest Log: [INFO] CRAB ID: 7284 in slot1 2016-12-15 15:15:13 (13764): Guest Log: [INFO] Job finished in slot1 with 151. 2016-12-15 15:15:18 (13764): Guest Log: [INFO] New Job Starting in slot1 2016-12-15 15:15:18 (13764): Guest Log: [INFO] Condor JobID: 1808060 in slot1 2016-12-15 15:15:28 (13764): Guest Log: [INFO] CRAB ID: 472 in slot1 Jobs that finish without an error normally upload a result via a PUT request to vc-cms-output.cs3.cern.ch. Since the condor servers are back I miss this PUT in my logs. There is nothing, not even an aborted upload. ID: 28124 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2681 Credit: 286,872,323 RAC: 58,502	Message 28126 - Posted: 15 Dec 2016, 16:52:03 UTC - in response to Message 28124. At least 1 successful job at 2016-12-15 16:24:06 UTC. BOINC/slots/2/stderr.txt:2016-12-15 17:24:06 (13764): Guest Log: [INFO] Job finished in slot1 with 0. Current error rate is 75%. ID: 28126 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1110 Credit: 9,434,225 RAC: 8,167	Message 28129 - Posted: 16 Dec 2016, 10:29:13 UTC - in response to Message 28124. Some lines from stderr.txt of the currently running WU: 2016-12-15 09:45:55 (13764): Guest Log: [INFO] New Job Starting in slot1 2016-12-15 09:45:55 (13764): Guest Log: [INFO] Condor JobID: 1806162 in slot1 2016-12-15 09:46:10 (13764): Guest Log: [INFO] CRAB ID: 6604 in slot1 2016-12-15 11:37:34 (13764): Guest Log: [INFO] Job finished in slot1 with 151. 2016-12-15 11:37:41 (13764): Guest Log: [INFO] New Job Starting in slot1 2016-12-15 11:37:41 (13764): Guest Log: [INFO] Condor JobID: 1805854 in slot1 2016-12-15 11:37:51 (13764): Guest Log: [INFO] CRAB ID: 6313 in slot1 2016-12-15 13:24:07 (13764): Guest Log: [INFO] Job finished in slot1 with 151. 2016-12-15 13:24:14 (13764): Guest Log: [INFO] New Job Starting in slot1 2016-12-15 13:24:14 (13764): Guest Log: [INFO] Condor JobID: 1806819 in slot1 2016-12-15 13:24:23 (13764): Guest Log: [INFO] CRAB ID: 7284 in slot1 2016-12-15 15:15:13 (13764): Guest Log: [INFO] Job finished in slot1 with 151. 2016-12-15 15:15:18 (13764): Guest Log: [INFO] New Job Starting in slot1 2016-12-15 15:15:18 (13764): Guest Log: [INFO] Condor JobID: 1808060 in slot1 2016-12-15 15:15:28 (13764): Guest Log: [INFO] CRAB ID: 472 in slot1 Jobs that finish without an error normally upload a result via a PUT request to vc-cms-output.cs3.cern.ch. Since the condor servers are back I miss this PUT in my logs. There is nothing, not even an aborted upload. I'm afraid you were caught by the bunch of "echoes" from the Condor failure. Jobs completed and staged out during the problem, but Condor was unable to log this. So Condor timed them out and rescheduled (I guess) when it was healthy again, still as attempt 0. When the rescheduled job tried to stage out it found the original result file already there, so it deleted it and reported an error, upon which Condor rescheduled it as attempt 1. :-( ID: 28129 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2681 Credit: 286,872,323 RAC: 58,502	Message 28130 - Posted: 16 Dec 2016, 10:58:15 UTC - in response to Message 28129. So, the error 151 dissappears after a couple of WUs and thereÂ´s no need to inform the project admins? ItÂ´s one of those things that correct itself "automagically" when the faulty jobs leave the queue, right? Perhaps I reconnected too early. My currently running WUs show 100% success rate: BOINC/slots/4/stderr.txt:2016-12-16 01:34:04 (4480): Guest Log: [INFO] Job finished in slot1 with 0. BOINC/slots/4/stderr.txt:2016-12-16 03:48:34 (4480): Guest Log: [INFO] Job finished in slot1 with 0. BOINC/slots/4/stderr.txt:2016-12-16 06:30:29 (4480): Guest Log: [INFO] Job finished in slot1 with 0. BOINC/slots/4/stderr.txt:2016-12-16 08:50:32 (4480): Guest Log: [INFO] Job finished in slot1 with 0. BOINC/slots/4/stderr.txt:2016-12-16 11:13:23 (4480): Guest Log: [INFO] Job finished in slot1 with 0. BOINC/slots/6/stderr.txt:2016-12-16 01:13:16 (15113): Guest Log: [INFO] Job finished in slot1 with 0. BOINC/slots/6/stderr.txt:2016-12-16 03:05:51 (15113): Guest Log: [INFO] Job finished in slot1 with 0. BOINC/slots/6/stderr.txt:2016-12-16 05:15:57 (15113): Guest Log: [INFO] Job finished in slot1 with 0. BOINC/slots/6/stderr.txt:2016-12-16 07:02:49 (15113): Guest Log: [INFO] Job finished in slot1 with 0. BOINC/slots/6/stderr.txt:2016-12-16 08:52:20 (15113): Guest Log: [INFO] Job finished in slot1 with 0. BOINC/slots/6/stderr.txt:2016-12-16 10:40:22 (15113): Guest Log: [INFO] Job finished in slot1 with 0. ID: 28130 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1110 Credit: 9,434,225 RAC: 8,167	Message 28134 - Posted: 16 Dec 2016, 15:21:20 UTC - in response to Message 28130. Yes, it's "self healing" to a large extent. I do monitor the graphs as best I can, and large spikes send me trawling through the log files until I think I understand them. ID: 28134 · Reply Quote

ritterm Send message Joined: 30 May 08 Posts: 93 Credit: 5,160,246 RAC: 0	Message 28170 - Posted: 21 Dec 2016, 4:15:13 UTC Last modified: 21 Dec 2016, 4:16:48 UTC More Condor problems? I just noticed several CMS jobs with similar output. 2016-12-20 22:59:37 (29181): Guest Log: [DEBUG] HTCondor ping 2016-12-20 22:59:37 (29181): Guest Log: [DEBUG] 1 2016-12-20 22:59:37 (29181): Guest Log: [DEBUG] 12/20/16 22:59:37 recognized DC_NOP as command name, using command 60011. 2016-12-20 22:59:37 (29181): Guest Log: 12/20/16 22:59:37 attempt to connect to <130.246.180.120:9623> failed: Connection refused (connect errno = 111). 2016-12-20 22:59:37 (29181): Guest Log: ERROR: failed to make connection to <130.246.180.120:9623> 2016-12-20 22:59:37 (29181): Guest Log: [ERROR] Could not ping HTCondor. 2016-12-20 22:59:37 (29181): Guest Log: [INFO] Shutting Down. 2016-12-20 22:59:37 (29181): VM Completion File Detected. 2016-12-20 22:59:37 (29181): VM Completion Message: Could not ping HTCondor. ID: 28170 · Reply Quote

Laurence Project administrator Project developer Send message Joined: 20 Jun 14 Posts: 407 Credit: 238,712 RAC: 0	Message 28171 - Posted: 21 Dec 2016, 9:06:59 UTC - in response to Message 28170. More Condor problems?] Looks like it. ID: 28171 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1110 Credit: 9,434,225 RAC: 8,167	Message 28172 - Posted: 21 Dec 2016, 9:39:58 UTC - in response to Message 28170. I deleted a couple of large log files in /var/backup. I'll go through compressing some of the older stuff later, but I don't have permission on all of the files. ID: 28172 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1110 Credit: 9,434,225 RAC: 8,167	Message 28174 - Posted: 21 Dec 2016, 10:44:33 UTC - in response to Message 28172. Last modified: 21 Dec 2016, 11:00:55 UTC ~~Please set No New Tasks for CMS until Condor is working again.~~ Please ignore this message now, Condor is up again. ID: 28174 · Reply Quote

Harri Liljeroos Send message Joined: 28 Sep 04 Posts: 780 Credit: 59,956,920 RAC: 47,650	Message 28709 - Posted: 29 Jan 2017, 9:37:57 UTC During early hours of 29th of January 5 CMS tasks were unable to load jobs from Condor. Ping was successful (0) but no jobs were sent to host. Earlier (28th) there was no problem and I had finished tasks. Here's one of the failures https://lhcathome.cern.ch/lhcathome/result.php?resultid=117282122 ID: 28709 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1110 Credit: 9,434,225 RAC: 8,167	Message 28710 - Posted: 29 Jan 2017, 11:11:53 UTC - in response to Message 28709. During early hours of 29th of January 5 CMS tasks were unable to load jobs from Condor. Ping was successful (0) but no jobs were sent to host. Earlier (28th) there was no problem and I had finished tasks. Here's one of the failures https://lhcathome.cern.ch/lhcathome/result.php?resultid=117282122 Yes, something's happened with jobs getting to the Condor server. WMStatus says there are plenty of jobs to be sent. Investigating (as well as I can from this far away...) ID: 28710 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1110 Credit: 9,434,225 RAC: 8,167	Message 28711 - Posted: 29 Jan 2017, 11:25:47 UTC - in response to Message 28710. Last modified: 29 Jan 2017, 13:00:05 UTC I've submitted a new batch of jobs in case WMStatus is badly reporting the jobs to be sent, but its report coincides with what I see on the Condor server status and is consistent with the behaviour we are seeing. I've notified Laurence. [Edit] The new batch turned up on WMStatus, so I think that is working as expected. [/Edit] ID: 28711 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2681 Credit: 286,872,323 RAC: 58,502	Message 28718 - Posted: 29 Jan 2017, 18:06:15 UTC CMS urgently needs admin intervention. ID: 28718 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1110 Credit: 9,434,225 RAC: 8,167	Message 28721 - Posted: 29 Jan 2017, 18:57:02 UTC - in response to Message 28718. CMS urgently needs admin intervention. I know, but they're a bit thin on the ground on Sundays. As far as I can tell the WMAgent server is working, and the Condor server is definitely working (I can query it for job details, etc). Communications between them and/or the WMStatus system also appears to be working. I'm now starting to suspect something like a full partition on the Condor server, but I don't think I can query that with condor_* commands -- and it's still serving jobs for the other apps. Sorry, all we can do is wait. ID: 28721 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1110 Credit: 9,434,225 RAC: 8,167	Message 28726 - Posted: 30 Jan 2017, 9:03:19 UTC - in response to Message 28718. Turns out the problem was with WMAgent. The error handler failed so it wasn't processing errors and job slots weren't cleared, leading to a bottleneck and no new jobs being sent. Seems to be largely cleared now. ID: 28726 · Reply Quote

LHC@home