Message boards : CMS Application : Condor Problems?
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jun 08
Posts: 1449
Credit: 77,231,598
RAC: 94,239
Message 28116 - Posted: 14 Dec 2016, 18:39:44 UTC

My currently running WUs lost their condor connection (ports 9623, 9818).
Is it a server problem?
ID: 28116 · Report as offensive     Reply Quote
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 598
Credit: 373,514,378
RAC: 43,650
Message 28117 - Posted: 14 Dec 2016, 19:11:19 UTC

looks like it:

12/14/16 20:09:40 CCBListener: connection to CCB server lcggwms02.gridpp.rl.ac.uk:9623 failed; will try to reconnect in 60 seconds.
12/14/16 20:10:01 attempt to connect to <130.246.180.120:9623> failed: Connection refused (connect errno = 111).
12/14/16 20:10:01 ERROR: SECMAN:2003:TCP connection to collector lcggwms02.gridpp.rl.ac.uk:9623 failed
ID: 28117 · Report as offensive     Reply Quote
Profile ritterm
Avatar

Send message
Joined: 30 May 08
Posts: 93
Credit: 5,160,246
RAC: 0
Message 28119 - Posted: 14 Dec 2016, 20:26:02 UTC

In case it helps, me too. Logs from a moment ago:

2016-12-14 15:17:46 (21300): Guest Log: [INFO] CMS application starting. Check log files.
2016-12-14 15:17:46 (21300): Guest Log: [DEBUG] HTCondor ping
2016-12-14 15:17:47 (21300): Guest Log: [DEBUG] 1
2016-12-14 15:17:47 (21300): Guest Log: [DEBUG] 12/14/16 15:17:46 recognized DC_NOP as command name, using command 60011.
2016-12-14 15:17:47 (21300): Guest Log: 12/14/16 15:17:46 attempt to connect to <130.246.180.120:9623> failed: Connection refused (connect errno = 111).
2016-12-14 15:17:47 (21300): Guest Log: ERROR: failed to make connection to <130.246.180.120:9623>
2016-12-14 15:17:47 (21300): Guest Log: [ERROR] Could not ping HTCondor.
2016-12-14 15:17:47 (21300): Guest Log: [INFO] Shutting Down.
ID: 28119 · Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer

Send message
Joined: 20 Jun 14
Posts: 336
Credit: 237,918
RAC: 0
Message 28120 - Posted: 14 Dec 2016, 23:19:55 UTC - in response to Message 28119.  

Yep, looks like the Condor server at RAL.
ID: 28120 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 672
Credit: 5,398,315
RAC: 10,749
Message 28121 - Posted: 15 Dec 2016, 0:00:25 UTC - in response to Message 28120.  

Yep, looks like the Condor server at RAL.

More info here.
ID: 28121 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jun 08
Posts: 1449
Credit: 77,231,598
RAC: 94,239
Message 28124 - Posted: 15 Dec 2016, 14:51:15 UTC

Some lines from stderr.txt of the currently running WU:

2016-12-15 09:45:55 (13764): Guest Log: [INFO] New Job Starting in slot1
2016-12-15 09:45:55 (13764): Guest Log: [INFO] Condor JobID: 1806162 in slot1
2016-12-15 09:46:10 (13764): Guest Log: [INFO] CRAB ID: 6604 in slot1
2016-12-15 11:37:34 (13764): Guest Log: [INFO] Job finished in slot1 with 151.
2016-12-15 11:37:41 (13764): Guest Log: [INFO] New Job Starting in slot1
2016-12-15 11:37:41 (13764): Guest Log: [INFO] Condor JobID: 1805854 in slot1
2016-12-15 11:37:51 (13764): Guest Log: [INFO] CRAB ID: 6313 in slot1
2016-12-15 13:24:07 (13764): Guest Log: [INFO] Job finished in slot1 with 151.
2016-12-15 13:24:14 (13764): Guest Log: [INFO] New Job Starting in slot1
2016-12-15 13:24:14 (13764): Guest Log: [INFO] Condor JobID: 1806819 in slot1
2016-12-15 13:24:23 (13764): Guest Log: [INFO] CRAB ID: 7284 in slot1
2016-12-15 15:15:13 (13764): Guest Log: [INFO] Job finished in slot1 with 151.
2016-12-15 15:15:18 (13764): Guest Log: [INFO] New Job Starting in slot1
2016-12-15 15:15:18 (13764): Guest Log: [INFO] Condor JobID: 1808060 in slot1
2016-12-15 15:15:28 (13764): Guest Log: [INFO] CRAB ID: 472 in slot1

Jobs that finish without an error normally upload a result via a PUT request to vc-cms-output.cs3.cern.ch.

Since the condor servers are back I miss this PUT in my logs.
There is nothing, not even an aborted upload.
ID: 28124 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jun 08
Posts: 1449
Credit: 77,231,598
RAC: 94,239
Message 28126 - Posted: 15 Dec 2016, 16:52:03 UTC - in response to Message 28124.  

At least 1 successful job at 2016-12-15 16:24:06 UTC.
BOINC/slots/2/stderr.txt:2016-12-15 17:24:06 (13764): Guest Log: [INFO] Job finished in slot1 with 0.

Current error rate is 75%.
ID: 28126 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 672
Credit: 5,398,315
RAC: 10,749
Message 28129 - Posted: 16 Dec 2016, 10:29:13 UTC - in response to Message 28124.  

Some lines from stderr.txt of the currently running WU:

2016-12-15 09:45:55 (13764): Guest Log: [INFO] New Job Starting in slot1
2016-12-15 09:45:55 (13764): Guest Log: [INFO] Condor JobID: 1806162 in slot1
2016-12-15 09:46:10 (13764): Guest Log: [INFO] CRAB ID: 6604 in slot1
2016-12-15 11:37:34 (13764): Guest Log: [INFO] Job finished in slot1 with 151.
2016-12-15 11:37:41 (13764): Guest Log: [INFO] New Job Starting in slot1
2016-12-15 11:37:41 (13764): Guest Log: [INFO] Condor JobID: 1805854 in slot1
2016-12-15 11:37:51 (13764): Guest Log: [INFO] CRAB ID: 6313 in slot1
2016-12-15 13:24:07 (13764): Guest Log: [INFO] Job finished in slot1 with 151.
2016-12-15 13:24:14 (13764): Guest Log: [INFO] New Job Starting in slot1
2016-12-15 13:24:14 (13764): Guest Log: [INFO] Condor JobID: 1806819 in slot1
2016-12-15 13:24:23 (13764): Guest Log: [INFO] CRAB ID: 7284 in slot1
2016-12-15 15:15:13 (13764): Guest Log: [INFO] Job finished in slot1 with 151.
2016-12-15 15:15:18 (13764): Guest Log: [INFO] New Job Starting in slot1
2016-12-15 15:15:18 (13764): Guest Log: [INFO] Condor JobID: 1808060 in slot1
2016-12-15 15:15:28 (13764): Guest Log: [INFO] CRAB ID: 472 in slot1

Jobs that finish without an error normally upload a result via a PUT request to vc-cms-output.cs3.cern.ch.

Since the condor servers are back I miss this PUT in my logs.
There is nothing, not even an aborted upload.

I'm afraid you were caught by the bunch of "echoes" from the Condor failure. Jobs completed and staged out during the problem, but Condor was unable to log this. So Condor timed them out and rescheduled (I guess) when it was healthy again, still as attempt 0. When the rescheduled job tried to stage out it found the original result file already there, so it deleted it and reported an error, upon which Condor rescheduled it as attempt 1. :-(
ID: 28129 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jun 08
Posts: 1449
Credit: 77,231,598
RAC: 94,239
Message 28130 - Posted: 16 Dec 2016, 10:58:15 UTC - in response to Message 28129.  

So, the error 151 dissappears after a couple of WUs and there´s no need to inform the project admins?
It´s one of those things that correct itself "automagically" when the faulty jobs leave the queue, right?

Perhaps I reconnected too early.
My currently running WUs show 100% success rate:
BOINC/slots/4/stderr.txt:2016-12-16 01:34:04 (4480): Guest Log: [INFO] Job finished in slot1 with 0.
BOINC/slots/4/stderr.txt:2016-12-16 03:48:34 (4480): Guest Log: [INFO] Job finished in slot1 with 0.
BOINC/slots/4/stderr.txt:2016-12-16 06:30:29 (4480): Guest Log: [INFO] Job finished in slot1 with 0.
BOINC/slots/4/stderr.txt:2016-12-16 08:50:32 (4480): Guest Log: [INFO] Job finished in slot1 with 0.
BOINC/slots/4/stderr.txt:2016-12-16 11:13:23 (4480): Guest Log: [INFO] Job finished in slot1 with 0.

BOINC/slots/6/stderr.txt:2016-12-16 01:13:16 (15113): Guest Log: [INFO] Job finished in slot1 with 0.
BOINC/slots/6/stderr.txt:2016-12-16 03:05:51 (15113): Guest Log: [INFO] Job finished in slot1 with 0.
BOINC/slots/6/stderr.txt:2016-12-16 05:15:57 (15113): Guest Log: [INFO] Job finished in slot1 with 0.
BOINC/slots/6/stderr.txt:2016-12-16 07:02:49 (15113): Guest Log: [INFO] Job finished in slot1 with 0.
BOINC/slots/6/stderr.txt:2016-12-16 08:52:20 (15113): Guest Log: [INFO] Job finished in slot1 with 0.
BOINC/slots/6/stderr.txt:2016-12-16 10:40:22 (15113): Guest Log: [INFO] Job finished in slot1 with 0.
ID: 28130 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 672
Credit: 5,398,315
RAC: 10,749
Message 28134 - Posted: 16 Dec 2016, 15:21:20 UTC - in response to Message 28130.  

Yes, it's "self healing" to a large extent. I do monitor the graphs as best I can, and large spikes send me trawling through the log files until I think I understand them.
ID: 28134 · Report as offensive     Reply Quote
Profile ritterm
Avatar

Send message
Joined: 30 May 08
Posts: 93
Credit: 5,160,246
RAC: 0
Message 28170 - Posted: 21 Dec 2016, 4:15:13 UTC
Last modified: 21 Dec 2016, 4:16:48 UTC

More Condor problems? I just noticed several CMS jobs with similar output.

2016-12-20 22:59:37 (29181): Guest Log: [DEBUG] HTCondor ping
2016-12-20 22:59:37 (29181): Guest Log: [DEBUG] 1
2016-12-20 22:59:37 (29181): Guest Log: [DEBUG] 12/20/16 22:59:37 recognized DC_NOP as command name, using command 60011.
2016-12-20 22:59:37 (29181): Guest Log: 12/20/16 22:59:37 attempt to connect to <130.246.180.120:9623> failed: Connection refused (connect errno = 111).
2016-12-20 22:59:37 (29181): Guest Log: ERROR: failed to make connection to <130.246.180.120:9623>
2016-12-20 22:59:37 (29181): Guest Log: [ERROR] Could not ping HTCondor.
2016-12-20 22:59:37 (29181): Guest Log: [INFO] Shutting Down.
2016-12-20 22:59:37 (29181): VM Completion File Detected.
2016-12-20 22:59:37 (29181): VM Completion Message: Could not ping HTCondor.
ID: 28170 · Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer

Send message
Joined: 20 Jun 14
Posts: 336
Credit: 237,918
RAC: 0
Message 28171 - Posted: 21 Dec 2016, 9:06:59 UTC - in response to Message 28170.  

More Condor problems?]


Looks like it.
ID: 28171 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 672
Credit: 5,398,315
RAC: 10,749
Message 28172 - Posted: 21 Dec 2016, 9:39:58 UTC - in response to Message 28170.  

I deleted a couple of large log files in /var/backup. I'll go through compressing some of the older stuff later, but I don't have permission on all of the files.
ID: 28172 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 672
Credit: 5,398,315
RAC: 10,749
Message 28174 - Posted: 21 Dec 2016, 10:44:33 UTC - in response to Message 28172.  
Last modified: 21 Dec 2016, 11:00:55 UTC

Please set No New Tasks for CMS until Condor is working again.

Please ignore this message now, Condor is up again.
ID: 28174 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 439
Credit: 23,169,801
RAC: 13,997
Message 28709 - Posted: 29 Jan 2017, 9:37:57 UTC

During early hours of 29th of January 5 CMS tasks were unable to load jobs from Condor. Ping was successful (0) but no jobs were sent to host. Earlier (28th) there was no problem and I had finished tasks. Here's one of the failures https://lhcathome.cern.ch/lhcathome/result.php?resultid=117282122
ID: 28709 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 672
Credit: 5,398,315
RAC: 10,749
Message 28710 - Posted: 29 Jan 2017, 11:11:53 UTC - in response to Message 28709.  

During early hours of 29th of January 5 CMS tasks were unable to load jobs from Condor. Ping was successful (0) but no jobs were sent to host. Earlier (28th) there was no problem and I had finished tasks. Here's one of the failures https://lhcathome.cern.ch/lhcathome/result.php?resultid=117282122

Yes, something's happened with jobs getting to the Condor server. WMStatus says there are plenty of jobs to be sent. Investigating (as well as I can from this far away...)
ID: 28710 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 672
Credit: 5,398,315
RAC: 10,749
Message 28711 - Posted: 29 Jan 2017, 11:25:47 UTC - in response to Message 28710.  
Last modified: 29 Jan 2017, 13:00:05 UTC

I've submitted a new batch of jobs in case WMStatus is badly reporting the jobs to be sent, but its report coincides with what I see on the Condor server status and is consistent with the behaviour we are seeing. I've notified Laurence.

[Edit] The new batch turned up on WMStatus, so I think that is working as expected. [/Edit]
ID: 28711 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jun 08
Posts: 1449
Credit: 77,231,598
RAC: 94,239
Message 28718 - Posted: 29 Jan 2017, 18:06:15 UTC

CMS urgently needs admin intervention.
ID: 28718 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 672
Credit: 5,398,315
RAC: 10,749
Message 28721 - Posted: 29 Jan 2017, 18:57:02 UTC - in response to Message 28718.  

CMS urgently needs admin intervention.

I know, but they're a bit thin on the ground on Sundays. As far as I can tell the WMAgent server is working, and the Condor server is definitely working (I can query it for job details, etc). Communications between them and/or the WMStatus system also appears to be working. I'm now starting to suspect something like a full partition on the Condor server, but I don't think I can query that with condor_* commands -- and it's still serving jobs for the other apps.

Sorry, all we can do is wait.
ID: 28721 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 672
Credit: 5,398,315
RAC: 10,749
Message 28726 - Posted: 30 Jan 2017, 9:03:19 UTC - in response to Message 28718.  

Turns out the problem was with WMAgent. The error handler failed so it wasn't processing errors and job slots weren't cleared, leading to a bottleneck and no new jobs being sent. Seems to be largely cleared now.
ID: 28726 · Report as offensive     Reply Quote
1 · 2 · Next

Message boards : CMS Application : Condor Problems?


©2020 CERN