Message boards : CMS Application : CMS Tasks Failing
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 . . . 22 · Next

AuthorMessage
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 998
Credit: 6,264,307
RAC: 71
Message 30755 - Posted: 12 Jun 2017, 9:48:57 UTC - in response to Message 30750.  

Problem understood. I'll leave it to those more closely involved to explain. :-0!
ID: 30755 · Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 1114
Credit: 49,502,974
RAC: 4,007
Message 31189 - Posted: 29 Jun 2017, 21:04:27 UTC

Just took a look and I see I have 10 of these so far (5 Valids with the same pc's)

DC_NOP failed!
Guest Log: AUTHENTICATE:1003:Failed to authenticate with any method
Guest Log: AUTHENTICATE:1004:Failed to authenticate using GSI
: Guest Log: GSI:5004:Failed to authenticate. Globus is reporting error (655360:16)
: Guest Log: 06/29/17 19:45:43 recognized DC_NOP as command name, using command 60011.
Guest Log: 06/29/17 19:45:47 Condor GSI authentication failure
: Guest Log: GSS Major Status: Authentication Failed
Guest Log: GSS Minor Status Error Chain:
: Guest Log: globus_gss_assist: Error during context initialization
Guest Log: OpenSSL Error: s3_clnt.c:1178: in library: SSL routines, function SSL3_GET_SERVER_CERTIFICATE: certificate verify failed
Guest Log: globus_gsi_callback_module: Could not Guest Log: globus_gsi_callback_module: Can't get the local trusted CA certificate: Untrusted self-signed certificate in chain with hash c2a48ab6
Guest Log: 06/29/17 19:45:48 SECMAN: required authentication with local collector failed, so aborting command DC_SEC_QUERY.
Guest Log: [ERROR] Could not ping HTCondor.
Guest Log: [INFO] Shutting Down.
VM Completion File Detected.
VM Completion Message: Could not ping HTCondor.

Might have to suspend these if I see it continue right now
.
ID: 31189 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 998
Credit: 6,264,307
RAC: 71
Message 31190 - Posted: 29 Jun 2017, 23:00:16 UTC - in response to Message 31189.  

Does not look good. Something must be amiss with the certification server.
ID: 31190 · Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 1114
Credit: 49,502,974
RAC: 4,007
Message 31211 - Posted: 30 Jun 2017, 22:31:22 UTC - in response to Message 31190.  

Does not look good. Something must be amiss with the certification server.


Well Ivan, earlier today I still got a few of those but also 15 Valids

Not sure why and I never had that problem with the thousands of Theory tasks I have run on these same computers here (and over at -dev)

I did run a handful of the 2-core version there too with no problem.
Volunteer Mad Scientist For Life
ID: 31211 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 998
Credit: 6,264,307
RAC: 71
Message 31212 - Posted: 1 Jul 2017, 0:48:47 UTC - in response to Message 31211.  

I'll have to let Laurence comment on that, it's in his bailiwick.
ID: 31212 · Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 1114
Credit: 49,502,974
RAC: 4,007
Message 31269 - Posted: 3 Jul 2017, 22:26:24 UTC - in response to Message 31212.  

I'll have to let Laurence comment on that, it's in his bailiwick.


I had this same problem today with 17 Errors on my CMS stats

But about 45 Valids at the same time (several are shorter than usual tasks)
ID: 31269 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 998
Credit: 6,264,307
RAC: 71
Message 31270 - Posted: 3 Jul 2017, 23:23:20 UTC - in response to Message 31269.  

[But about 45 Valids at the same time (several are shorter than usual tasks)

Possibly the shorter tasks were ones that were curtailed wnen I interrupted the queue because I'd submitted too many tasks at once.
ID: 31270 · Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 1114
Credit: 49,502,974
RAC: 4,007
Message 31275 - Posted: 4 Jul 2017, 6:32:40 UTC - in response to Message 31270.  

[But about 45 Valids at the same time (several are shorter than usual tasks)

Possibly the shorter tasks were ones that were curtailed wnen I interrupted the queue because I'd submitted too many tasks at once.


Yes that makes sense Ivan
I just checked the time on those tasks and the were all finished within 5 minutes of each other.

Still having that DC_NOP failed!
AUTHENTICATE:1002:Failure performing handshake and recognized DC_NOP as command name, using command 60011.
ERROR: couldn't locate (null)!
[ERROR] Could not ping HTCondor.
problem.

I am going to take a look at the times for those since I am running these CMS on three pc's that are all the same (one does have 24GB ram and the other 2 have 16GB)

I noticed two have IPv4 and IPv6 and one still says just IPv4 for some reason but that isn't the problem and all 3 running on the new satellite IP

Download is fast and upload is pretty good too, but you know how servers can get when they have to travel as far as it is from my Dish to Cern servers.

BUT I have turned in 150 Valids in the last 4 days with these three 8-cores
Volunteer Mad Scientist For Life
ID: 31275 · Report as offensive     Reply Quote
Profile rbpeake

Send message
Joined: 17 Sep 04
Posts: 99
Credit: 30,618,118
RAC: 3,938
Message 31288 - Posted: 4 Jul 2017, 15:42:06 UTC

This said Condor exited after 11117 seconds without running a job.

https://lhcathome.cern.ch/lhcathome/result.php?resultid=150399989
Regards,
Bob P.
ID: 31288 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 998
Credit: 6,264,307
RAC: 71
Message 31289 - Posted: 4 Jul 2017, 16:35:33 UTC - in response to Message 31288.  

This said Condor exited after 11117 seconds without running a job.

https://lhcathome.cern.ch/lhcathome/result.php?resultid=150399989

The HTcondor update was around 1300 UTC, so it looks like you got caught across it. I'm not expert enough to speculate exactly what happened; I've been assured that any jobs caught up like that will be resubmitted, but it's a pity you didn't get credit for your CPU time.
ID: 31289 · Report as offensive     Reply Quote
Profile rbpeake

Send message
Joined: 17 Sep 04
Posts: 99
Credit: 30,618,118
RAC: 3,938
Message 31290 - Posted: 4 Jul 2017, 17:08:03 UTC - in response to Message 31289.  

This said Condor exited after 11117 seconds without running a job.

https://lhcathome.cern.ch/lhcathome/result.php?resultid=150399989

The HTcondor update was around 1300 UTC, so it looks like you got caught across it. I'm not expert enough to speculate exactly what happened; I've been assured that any jobs caught up like that will be resubmitted, but it's a pity you didn't get credit for your CPU time.

No problem, thanks for the explanation!
Regards,
Bob P.
ID: 31290 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,950,574
RAC: 137,250
Message 31355 - Posted: 11 Jul 2017, 11:42:51 UTC

@Ivan
Are you aware of the read peaks shown on the 2 graphs at the bottom of the cms_job page?
ID: 31355 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 998
Credit: 6,264,307
RAC: 71
Message 31356 - Posted: 11 Jul 2017, 15:04:05 UTC - in response to Message 31355.  

@Ivan
Are you aware of the read peaks shown on the 2 graphs at the bottom of the cms_job page?

Hadn't seen that, thanks for pointing it out. Possibly correlates with a small perturbation in the proxy graph, so I'll put it down to a network disturbance, unless it continues.
ID: 31356 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 998
Credit: 6,264,307
RAC: 71
Message 31410 - Posted: 15 Jul 2017, 17:52:23 UTC - in response to Message 31356.  

Hmm, another spike just now that I don't see an immediate cause for. I'll poke around my monitors and the CERN computing mailing lists.
ID: 31410 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 998
Credit: 6,264,307
RAC: 71
Message 31412 - Posted: 15 Jul 2017, 19:10:02 UTC - in response to Message 31410.  

Hmm, another spike just now that I don't see an immediate cause for. I'll poke around my monitors and the CERN computing mailing lists.

No clues yet. All my completed tasks seem to have terminated normally.
ID: 31412 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,390,089
RAC: 102,176
Message 31413 - Posted: 15 Jul 2017, 19:15:38 UTC - in response to Message 31412.  

there seems to be an issue with CMS tasks for all the day: on all my 3 PCs on which I am crunching CMS, I notice that a running task does not use the CPU for extended periods of time.
After such periods, it continues to work "normally".

Maybe this is in the nature of the current CMS work units, or there is some problem with Condor, WMAgent, the network, or ... ???
ID: 31413 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 998
Credit: 6,264,307
RAC: 71
Message 31415 - Posted: 15 Jul 2017, 19:39:43 UTC - in response to Message 31413.  
Last modified: 15 Jul 2017, 19:44:06 UTC

there seems to be an issue with CMS tasks for all the day: on all my 3 PCs on which I am crunching CMS, I notice that a running task does not use the CPU for extended periods of time.
After such periods, it continues to work "normally".

Maybe this is in the nature of the current CMS work units, or there is some problem with Condor, WMAgent, the network, or ... ???

The work units have been the same since last Christmas. This sounds like a network issue, although our proxy is not showing any unusual signs. Main suspect is a problem accessing the cvmfs repository, although we do also rely heavily on the conditions database (frontier) which I suspect uses its own proxies. I'll see if there's anything suspicious in your task logs.
[Edit] Nothing to see yet; all your completed tasks finished before the anomaly began [/Edit]
ID: 31415 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,390,089
RAC: 102,176
Message 31416 - Posted: 15 Jul 2017, 19:59:10 UTC - in response to Message 31415.  
Last modified: 15 Jul 2017, 20:05:11 UTC

I'll see if there's anything suspicious in your task logs.
[Edit] Nothing to see yet; all your completed tasks finished before the anomaly began [/Edit]

Ivan, this task definitely shows an anomaly: https://lhcathome.cern.ch/lhcathome/result.php?resultid=150956525

Total runtime is almost 2 hours longer than CPU time. Whereas normally the difference is only between 30 and 45 minutes.
ID: 31416 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 998
Credit: 6,264,307
RAC: 71
Message 31417 - Posted: 15 Jul 2017, 20:09:56 UTC - in response to Message 31416.  

I'll see if there's anything suspicious in your task logs.
[Edit] Nothing to see yet; all your completed tasks finished before the anomaly began [/Edit]

Ivan, this task definitely shows an anomality: https://lhcathome.cern.ch/lhcathome/result.php?resultid=150956525

Total runtime is almost 2 hours longer than CPU time. Whereas normally the difference is only between 30 and 45 minutes.

OK, no real smoking gun yet. In one instance it took around seven minutes for a new job to start after the previous one finished -- the rest are around 10-30 seconds, so that's a bit suspicious. Any slack to take up the rest of the excessive extra time must have come during the actual job runtime. Since the red peaks in the Job Activity graphs are actually for failed jobs, I'm still looking for jobs (not necessarily tasks) that failed after about 1500 UTC today.
ID: 31417 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,950,574
RAC: 137,250
Message 31418 - Posted: 15 Jul 2017, 20:20:23 UTC
Last modified: 15 Jul 2017, 20:30:05 UTC

It seems that there is an error in the stage-out phase.
I saved stderr.log and stdout.log of my currently running WU.
Let me know if they are of interest.

<edit>
My WU decided to finish it's break while I was typing the message above.
So, at the moment it is running normal.
</edit>
ID: 31418 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 . . . 22 · Next

Message boards : CMS Application : CMS Tasks Failing


©2024 CERN