Message boards :
CMS Application :
CMS Tasks Failing
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 . . . 22 · Next
Author | Message |
---|---|
Send message Joined: 29 Aug 05 Posts: 1060 Credit: 7,737,455 RAC: 1,317 |
|
Send message Joined: 24 Oct 04 Posts: 1173 Credit: 54,834,089 RAC: 16,184 |
Just took a look and I see I have 10 of these so far (5 Valids with the same pc's) DC_NOP failed! Guest Log: AUTHENTICATE:1003:Failed to authenticate with any method Guest Log: AUTHENTICATE:1004:Failed to authenticate using GSI : Guest Log: GSI:5004:Failed to authenticate. Globus is reporting error (655360:16) : Guest Log: 06/29/17 19:45:43 recognized DC_NOP as command name, using command 60011. Guest Log: 06/29/17 19:45:47 Condor GSI authentication failure : Guest Log: GSS Major Status: Authentication Failed Guest Log: GSS Minor Status Error Chain: : Guest Log: globus_gss_assist: Error during context initialization Guest Log: OpenSSL Error: s3_clnt.c:1178: in library: SSL routines, function SSL3_GET_SERVER_CERTIFICATE: certificate verify failed Guest Log: globus_gsi_callback_module: Could not Guest Log: globus_gsi_callback_module: Can't get the local trusted CA certificate: Untrusted self-signed certificate in chain with hash c2a48ab6 Guest Log: 06/29/17 19:45:48 SECMAN: required authentication with local collector failed, so aborting command DC_SEC_QUERY. Guest Log: [ERROR] Could not ping HTCondor. Guest Log: [INFO] Shutting Down. VM Completion File Detected. VM Completion Message: Could not ping HTCondor. Might have to suspend these if I see it continue right now . |
Send message Joined: 29 Aug 05 Posts: 1060 Credit: 7,737,455 RAC: 1,317 |
|
Send message Joined: 24 Oct 04 Posts: 1173 Credit: 54,834,089 RAC: 16,184 |
Does not look good. Something must be amiss with the certification server. Well Ivan, earlier today I still got a few of those but also 15 Valids Not sure why and I never had that problem with the thousands of Theory tasks I have run on these same computers here (and over at -dev) I did run a handful of the 2-core version there too with no problem. Volunteer Mad Scientist For Life |
Send message Joined: 29 Aug 05 Posts: 1060 Credit: 7,737,455 RAC: 1,317 |
|
Send message Joined: 24 Oct 04 Posts: 1173 Credit: 54,834,089 RAC: 16,184 |
I'll have to let Laurence comment on that, it's in his bailiwick. I had this same problem today with 17 Errors on my CMS stats But about 45 Valids at the same time (several are shorter than usual tasks) |
Send message Joined: 29 Aug 05 Posts: 1060 Credit: 7,737,455 RAC: 1,317 |
|
Send message Joined: 24 Oct 04 Posts: 1173 Credit: 54,834,089 RAC: 16,184 |
[But about 45 Valids at the same time (several are shorter than usual tasks) Yes that makes sense Ivan I just checked the time on those tasks and the were all finished within 5 minutes of each other. Still having that DC_NOP failed! AUTHENTICATE:1002:Failure performing handshake and recognized DC_NOP as command name, using command 60011. ERROR: couldn't locate (null)! [ERROR] Could not ping HTCondor. problem. I am going to take a look at the times for those since I am running these CMS on three pc's that are all the same (one does have 24GB ram and the other 2 have 16GB) I noticed two have IPv4 and IPv6 and one still says just IPv4 for some reason but that isn't the problem and all 3 running on the new satellite IP Download is fast and upload is pretty good too, but you know how servers can get when they have to travel as far as it is from my Dish to Cern servers. BUT I have turned in 150 Valids in the last 4 days with these three 8-cores Volunteer Mad Scientist For Life |
Send message Joined: 17 Sep 04 Posts: 105 Credit: 32,824,853 RAC: 389 |
This said Condor exited after 11117 seconds without running a job. https://lhcathome.cern.ch/lhcathome/result.php?resultid=150399989 Regards, Bob P. |
Send message Joined: 29 Aug 05 Posts: 1060 Credit: 7,737,455 RAC: 1,317 |
This said Condor exited after 11117 seconds without running a job. The HTcondor update was around 1300 UTC, so it looks like you got caught across it. I'm not expert enough to speculate exactly what happened; I've been assured that any jobs caught up like that will be resubmitted, but it's a pity you didn't get credit for your CPU time. |
Send message Joined: 17 Sep 04 Posts: 105 Credit: 32,824,853 RAC: 389 |
This said Condor exited after 11117 seconds without running a job. No problem, thanks for the explanation! Regards, Bob P. |
Send message Joined: 15 Jun 08 Posts: 2534 Credit: 253,880,596 RAC: 39,051 |
@Ivan Are you aware of the read peaks shown on the 2 graphs at the bottom of the cms_job page? |
Send message Joined: 29 Aug 05 Posts: 1060 Credit: 7,737,455 RAC: 1,317 |
@Ivan Hadn't seen that, thanks for pointing it out. Possibly correlates with a small perturbation in the proxy graph, so I'll put it down to a network disturbance, unless it continues. |
Send message Joined: 29 Aug 05 Posts: 1060 Credit: 7,737,455 RAC: 1,317 |
|
Send message Joined: 29 Aug 05 Posts: 1060 Credit: 7,737,455 RAC: 1,317 |
|
Send message Joined: 18 Dec 15 Posts: 1814 Credit: 118,509,401 RAC: 31,506 |
there seems to be an issue with CMS tasks for all the day: on all my 3 PCs on which I am crunching CMS, I notice that a running task does not use the CPU for extended periods of time. After such periods, it continues to work "normally". Maybe this is in the nature of the current CMS work units, or there is some problem with Condor, WMAgent, the network, or ... ??? |
Send message Joined: 29 Aug 05 Posts: 1060 Credit: 7,737,455 RAC: 1,317 |
there seems to be an issue with CMS tasks for all the day: on all my 3 PCs on which I am crunching CMS, I notice that a running task does not use the CPU for extended periods of time. The work units have been the same since last Christmas. This sounds like a network issue, although our proxy is not showing any unusual signs. Main suspect is a problem accessing the cvmfs repository, although we do also rely heavily on the conditions database (frontier) which I suspect uses its own proxies. I'll see if there's anything suspicious in your task logs. [Edit] Nothing to see yet; all your completed tasks finished before the anomaly began [/Edit] |
Send message Joined: 18 Dec 15 Posts: 1814 Credit: 118,509,401 RAC: 31,506 |
I'll see if there's anything suspicious in your task logs. Ivan, this task definitely shows an anomaly: https://lhcathome.cern.ch/lhcathome/result.php?resultid=150956525 Total runtime is almost 2 hours longer than CPU time. Whereas normally the difference is only between 30 and 45 minutes. |
Send message Joined: 29 Aug 05 Posts: 1060 Credit: 7,737,455 RAC: 1,317 |
I'll see if there's anything suspicious in your task logs. OK, no real smoking gun yet. In one instance it took around seven minutes for a new job to start after the previous one finished -- the rest are around 10-30 seconds, so that's a bit suspicious. Any slack to take up the rest of the excessive extra time must have come during the actual job runtime. Since the red peaks in the Job Activity graphs are actually for failed jobs, I'm still looking for jobs (not necessarily tasks) that failed after about 1500 UTC today. |
Send message Joined: 15 Jun 08 Posts: 2534 Credit: 253,880,596 RAC: 39,051 |
It seems that there is an error in the stage-out phase. I saved stderr.log and stdout.log of my currently running WU. Let me know if they are of interest. <edit> My WU decided to finish it's break while I was typing the message above. So, at the moment it is running normal. </edit> |
©2024 CERN