Thread 'CMS Tasks Failing'

Author	Message
ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1134 Credit: 11,600,972 RAC: 12,131	Message 30755 - Posted: 12 Jun 2017, 9:48:57 UTC - in response to Message 30750. Problem understood. I'll leave it to those more closely involved to explain. :-0! ID: 30755 · Reply Quote

Magic Quantum Mechanic Send message Joined: 24 Oct 04 Posts: 1261 Credit: 92,451,605 RAC: 106,758	Message 31189 - Posted: 29 Jun 2017, 21:04:27 UTC Just took a look and I see I have 10 of these so far (5 Valids with the same pc's) DC_NOP failed! Guest Log: AUTHENTICATE:1003:Failed to authenticate with any method Guest Log: AUTHENTICATE:1004:Failed to authenticate using GSI : Guest Log: GSI:5004:Failed to authenticate. Globus is reporting error (655360:16) : Guest Log: 06/29/17 19:45:43 recognized DC_NOP as command name, using command 60011. Guest Log: 06/29/17 19:45:47 Condor GSI authentication failure : Guest Log: GSS Major Status: Authentication Failed Guest Log: GSS Minor Status Error Chain: : Guest Log: globus_gss_assist: Error during context initialization Guest Log: OpenSSL Error: s3_clnt.c:1178: in library: SSL routines, function SSL3_GET_SERVER_CERTIFICATE: certificate verify failed Guest Log: globus_gsi_callback_module: Could not Guest Log: globus_gsi_callback_module: Can't get the local trusted CA certificate: Untrusted self-signed certificate in chain with hash c2a48ab6 Guest Log: 06/29/17 19:45:48 SECMAN: required authentication with local collector failed, so aborting command DC_SEC_QUERY. Guest Log: [ERROR] Could not ping HTCondor. Guest Log: [INFO] Shutting Down. VM Completion File Detected. VM Completion Message: Could not ping HTCondor. Might have to suspend these if I see it continue right now . ID: 31189 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1134 Credit: 11,600,972 RAC: 12,131	Message 31190 - Posted: 29 Jun 2017, 23:00:16 UTC - in response to Message 31189. Does not look good. Something must be amiss with the certification server. ID: 31190 · Reply Quote

Magic Quantum Mechanic Send message Joined: 24 Oct 04 Posts: 1261 Credit: 92,451,605 RAC: 106,758	Message 31211 - Posted: 30 Jun 2017, 22:31:22 UTC - in response to Message 31190. Does not look good. Something must be amiss with the certification server. Well Ivan, earlier today I still got a few of those but also 15 Valids Not sure why and I never had that problem with the thousands of Theory tasks I have run on these same computers here (and over at -dev) I did run a handful of the 2-core version there too with no problem. Volunteer Mad Scientist For Life unbelievable are you trying to promote linux again? ID: 31211 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1134 Credit: 11,600,972 RAC: 12,131	Message 31212 - Posted: 1 Jul 2017, 0:48:47 UTC - in response to Message 31211. I'll have to let Laurence comment on that, it's in his bailiwick. ID: 31212 · Reply Quote

Magic Quantum Mechanic Send message Joined: 24 Oct 04 Posts: 1261 Credit: 92,451,605 RAC: 106,758	Message 31269 - Posted: 3 Jul 2017, 22:26:24 UTC - in response to Message 31212. I'll have to let Laurence comment on that, it's in his bailiwick. I had this same problem today with 17 Errors on my CMS stats But about 45 Valids at the same time (several are shorter than usual tasks) ID: 31269 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1134 Credit: 11,600,972 RAC: 12,131	Message 31270 - Posted: 3 Jul 2017, 23:23:20 UTC - in response to Message 31269. [But about 45 Valids at the same time (several are shorter than usual tasks) Possibly the shorter tasks were ones that were curtailed wnen I interrupted the queue because I'd submitted too many tasks at once. ID: 31270 · Reply Quote

Magic Quantum Mechanic Send message Joined: 24 Oct 04 Posts: 1261 Credit: 92,451,605 RAC: 106,758	Message 31275 - Posted: 4 Jul 2017, 6:32:40 UTC - in response to Message 31270. [But about 45 Valids at the same time (several are shorter than usual tasks) Possibly the shorter tasks were ones that were curtailed wnen I interrupted the queue because I'd submitted too many tasks at once. Yes that makes sense Ivan I just checked the time on those tasks and the were all finished within 5 minutes of each other. Still having that DC_NOP failed! AUTHENTICATE:1002:Failure performing handshake and recognized DC_NOP as command name, using command 60011. ERROR: couldn't locate (null)! [ERROR] Could not ping HTCondor. problem. I am going to take a look at the times for those since I am running these CMS on three pc's that are all the same (one does have 24GB ram and the other 2 have 16GB) I noticed two have IPv4 and IPv6 and one still says just IPv4 for some reason but that isn't the problem and all 3 running on the new satellite IP Download is fast and upload is pretty good too, but you know how servers can get when they have to travel as far as it is from my Dish to Cern servers. BUT I have turned in 150 Valids in the last 4 days with these three 8-cores Volunteer Mad Scientist For Life unbelievable are you trying to promote linux again? ID: 31275 · Reply Quote

rbpeake Send message Joined: 17 Sep 04 Posts: 106 Credit: 36,549,147 RAC: 44	Message 31288 - Posted: 4 Jul 2017, 15:42:06 UTC This said Condor exited after 11117 seconds without running a job. https://lhcathome.cern.ch/lhcathome/result.php?resultid=150399989 Regards, Bob P. ID: 31288 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1134 Credit: 11,600,972 RAC: 12,131	Message 31289 - Posted: 4 Jul 2017, 16:35:33 UTC - in response to Message 31288. This said Condor exited after 11117 seconds without running a job. https://lhcathome.cern.ch/lhcathome/result.php?resultid=150399989 The HTcondor update was around 1300 UTC, so it looks like you got caught across it. I'm not expert enough to speculate exactly what happened; I've been assured that any jobs caught up like that will be resubmitted, but it's a pity you didn't get credit for your CPU time. ID: 31289 · Reply Quote

rbpeake Send message Joined: 17 Sep 04 Posts: 106 Credit: 36,549,147 RAC: 44	Message 31290 - Posted: 4 Jul 2017, 17:08:03 UTC - in response to Message 31289. This said Condor exited after 11117 seconds without running a job. https://lhcathome.cern.ch/lhcathome/result.php?resultid=150399989 The HTcondor update was around 1300 UTC, so it looks like you got caught across it. I'm not expert enough to speculate exactly what happened; I've been assured that any jobs caught up like that will be resubmitted, but it's a pity you didn't get credit for your CPU time. No problem, thanks for the explanation! Regards, Bob P. ID: 31290 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2724 Credit: 299,002,782 RAC: 58,257	Message 31355 - Posted: 11 Jul 2017, 11:42:51 UTC @Ivan Are you aware of the read peaks shown on the 2 graphs at the bottom of the cms_job page? ID: 31355 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1134 Credit: 11,600,972 RAC: 12,131	Message 31356 - Posted: 11 Jul 2017, 15:04:05 UTC - in response to Message 31355. @Ivan Are you aware of the read peaks shown on the 2 graphs at the bottom of the cms_job page? Hadn't seen that, thanks for pointing it out. Possibly correlates with a small perturbation in the proxy graph, so I'll put it down to a network disturbance, unless it continues. ID: 31356 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1134 Credit: 11,600,972 RAC: 12,131	Message 31410 - Posted: 15 Jul 2017, 17:52:23 UTC - in response to Message 31356. Hmm, another spike just now that I don't see an immediate cause for. I'll poke around my monitors and the CERN computing mailing lists. ID: 31410 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1134 Credit: 11,600,972 RAC: 12,131	Message 31412 - Posted: 15 Jul 2017, 19:10:02 UTC - in response to Message 31410. Hmm, another spike just now that I don't see an immediate cause for. I'll poke around my monitors and the CERN computing mailing lists. No clues yet. All my completed tasks seem to have terminated normally. ID: 31412 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1941 Credit: 156,355,067 RAC: 104,371	Message 31413 - Posted: 15 Jul 2017, 19:15:38 UTC - in response to Message 31412. there seems to be an issue with CMS tasks for all the day: on all my 3 PCs on which I am crunching CMS, I notice that a running task does not use the CPU for extended periods of time. After such periods, it continues to work "normally". Maybe this is in the nature of the current CMS work units, or there is some problem with Condor, WMAgent, the network, or ... ??? ID: 31413 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1134 Credit: 11,600,972 RAC: 12,131	Message 31415 - Posted: 15 Jul 2017, 19:39:43 UTC - in response to Message 31413. Last modified: 15 Jul 2017, 19:44:06 UTC there seems to be an issue with CMS tasks for all the day: on all my 3 PCs on which I am crunching CMS, I notice that a running task does not use the CPU for extended periods of time. After such periods, it continues to work "normally". Maybe this is in the nature of the current CMS work units, or there is some problem with Condor, WMAgent, the network, or ... ??? The work units have been the same since last Christmas. This sounds like a network issue, although our proxy is not showing any unusual signs. Main suspect is a problem accessing the cvmfs repository, although we do also rely heavily on the conditions database (frontier) which I suspect uses its own proxies. I'll see if there's anything suspicious in your task logs. [Edit] Nothing to see yet; all your completed tasks finished before the anomaly began [/Edit] ID: 31415 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1941 Credit: 156,355,067 RAC: 104,371	Message 31416 - Posted: 15 Jul 2017, 19:59:10 UTC - in response to Message 31415. Last modified: 15 Jul 2017, 20:05:11 UTC I'll see if there's anything suspicious in your task logs. [Edit] Nothing to see yet; all your completed tasks finished before the anomaly began [/Edit] Ivan, this task definitely shows an anomaly: https://lhcathome.cern.ch/lhcathome/result.php?resultid=150956525 Total runtime is almost 2 hours longer than CPU time. Whereas normally the difference is only between 30 and 45 minutes. ID: 31416 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1134 Credit: 11,600,972 RAC: 12,131	Message 31417 - Posted: 15 Jul 2017, 20:09:56 UTC - in response to Message 31416. I'll see if there's anything suspicious in your task logs. [Edit] Nothing to see yet; all your completed tasks finished before the anomaly began [/Edit] Ivan, this task definitely shows an anomality: https://lhcathome.cern.ch/lhcathome/result.php?resultid=150956525 Total runtime is almost 2 hours longer than CPU time. Whereas normally the difference is only between 30 and 45 minutes. OK, no real smoking gun yet. In one instance it took around seven minutes for a new job to start after the previous one finished -- the rest are around 10-30 seconds, so that's a bit suspicious. Any slack to take up the rest of the excessive extra time must have come during the actual job runtime. Since the red peaks in the Job Activity graphs are actually for failed jobs, I'm still looking for jobs (not necessarily tasks) that failed after about 1500 UTC today. ID: 31417 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2724 Credit: 299,002,782 RAC: 58,257	Message 31418 - Posted: 15 Jul 2017, 20:20:23 UTC Last modified: 15 Jul 2017, 20:30:05 UTC It seems that there is an error in the stage-out phase. I saved stderr.log and stdout.log of my currently running WU. Let me know if they are of interest. <edit> My WU decided to finish it's break while I was typing the message above. So, at the moment it is running normal. </edit> ID: 31418 · Reply Quote