Message boards : CMS Application : CMS Tasks Failing
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · 9 · 10 . . . 22 · Next

AuthorMessage
Erich56

Send message
Joined: 18 Dec 15
Posts: 1687
Credit: 103,078,694
RAC: 126,782
Message 31490 - Posted: 18 Jul 2017, 17:26:32 UTC - in response to Message 31483.  

I have startet 1 task ea. on two of my PCs. So far, everything seems fine.
Yet I'll wait for a while before I start CMS on my main PC.
ID: 31490 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1005
Credit: 6,269,877
RAC: 404
Message 31491 - Posted: 18 Jul 2017, 17:53:43 UTC - in response to Message 31490.  

I've had a couple of jobs finish, but it will be 12-18 hours before any tasks finish. That's the point when I'll be able to stop holding my breath. I haven't heard yet exactly what the problem was; I'll try to pass it on if I do.
ID: 31491 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1005
Credit: 6,269,877
RAC: 404
Message 31492 - Posted: 18 Jul 2017, 18:58:37 UTC - in response to Message 31491.  

I've had a couple of jobs finish, but it will be 12-18 hours before any tasks finish. That's the point when I'll be able to stop holding my breath. I haven't heard yet exactly what the problem was; I'll try to pass it on if I do.

I ... do not believe it! The problem apparently was that some[one|thing] closed the http and https ports to Data Bridge in the firewall!!! Why would anyone do that over a weekend? Investigations are continuing.
ID: 31492 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1687
Credit: 103,078,694
RAC: 126,782
Message 31493 - Posted: 19 Jul 2017, 5:10:30 UTC

But the WallClock Consumption Dashboard again looks strange, doesn't it?
ID: 31493 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2401
Credit: 225,492,250
RAC: 124,690
Message 31494 - Posted: 19 Jul 2017, 6:09:35 UTC

This morning it doesn't look like a typical firewall problem.
The data bridge is vc-cms-output.cs3.cern.ch.

dig vc-cms-output.cs3.cern.ch. +short
cs3.cern.ch.
188.185.79.228
188.184.86.186
188.184.81.67
188.184.83.125
188.185.70.64
188.184.87.159
188.184.94.97
188.184.95.16
188.185.79.29
188.184.95.79

dig cs3.cern.ch. +short |xargs -n1 -I {} nc -z -v -w 5 {} 80
Connection to 188.185.79.228 80 port [tcp/http] succeeded!
Connection to 188.184.83.125 80 port [tcp/http] succeeded!
Connection to 188.185.79.29 80 port [tcp/http] succeeded!
Connection to 188.184.87.159 80 port [tcp/http] succeeded!
Connection to 188.184.94.97 80 port [tcp/http] succeeded!
Connection to 188.184.95.79 80 port [tcp/http] succeeded!
Connection to 188.184.95.16 80 port [tcp/http] succeeded!
Connection to 188.185.70.64 80 port [tcp/http] succeeded!
Connection to 188.184.86.186 80 port [tcp/http] succeeded!
Connection to 188.184.81.67 80 port [tcp/http] succeeded!

dig cs3.cern.ch. +short |xargs -n1 -I {} nc -z -v -w 5 {} 443
Connection to 188.184.95.16 443 port [tcp/https] succeeded!
Connection to 188.185.79.228 443 port [tcp/https] succeeded!
Connection to 188.184.87.159 443 port [tcp/https] succeeded!
Connection to 188.184.94.97 443 port [tcp/https] succeeded!
Connection to 188.184.86.186 443 port [tcp/https] succeeded!
Connection to 188.184.95.79 443 port [tcp/https] succeeded!
Connection to 188.184.83.125 443 port [tcp/https] succeeded!
Connection to 188.184.81.67 443 port [tcp/https] succeeded!
Connection to 188.185.70.64 443 port [tcp/https] succeeded!
Connection to 188.185.79.29 443 port [tcp/https] succeeded!



At least this line (occures several times) from the stderr.log/stdout.log points out that there is an error in the script:
/var/lib/condor/execute/dir_6776/startup_environment.sh: line 52: syntax error near unexpected token `('



In addition there was a very unusual transient DNS error regarding a Theory WU last night.
See:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=151633754
2017-07-18 22:10:04 (14553): Guest Log: [DEBUG] Testing CVMFS connection to lhchomeproxy.cern.ch on port 3125
2017-07-18 22:10:04 (14553): Guest Log: [DEBUG] nc: getaddrinfo: Temporary failure in name resolution
2017-07-18 22:10:04 (14553): Guest Log: [DEBUG] 1
2017-07-18 22:10:04 (14553): Guest Log: [ERROR] Could not connect to lhchomeproxy.cern.ch on port 3125
2017-07-18 22:10:04 (14553): Guest Log: [INFO] Shutting Down.
ID: 31494 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1005
Credit: 6,269,877
RAC: 404
Message 31496 - Posted: 19 Jul 2017, 9:50:33 UTC - in response to Message 31494.  
Last modified: 19 Jul 2017, 9:50:44 UTC

At least this line (occures several times) from the stderr.log/stdout.log points out that there is an error in the script:
/var/lib/condor/execute/dir_6776/startup_environment.sh: line 52: syntax error near unexpected token `('

I mentioned that to Laurence in an email on Saturday, but I suspect he's on holiday or otherwise discommoded.
ID: 31496 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1005
Credit: 6,269,877
RAC: 404
Message 31497 - Posted: 19 Jul 2017, 9:55:15 UTC - in response to Message 31494.  

I'm not sure what's going on. It looks like stage-out problems again but LHC@Home assures me that the firewall hasn't been tampered with again. All my tasks are running OK, so unless and until I get a log-file of a failed stage-out I'm just running around waving my finger in the air and guessing.
Excuse me for a little while, I have a cryo-magnet to fill with liquid nitrogen...
ID: 31497 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1005
Credit: 6,269,877
RAC: 404
Message 31498 - Posted: 19 Jul 2017, 12:08:11 UTC - in response to Message 31497.  

Sorted, I think. My fault -- Laurence's cluster uses my credentials and I misjudged when I needed to create a new one (the warning messages start about four days out and give the remaining lifetime in seconds...), so the old one ran out. I just copied across a new 9-day proxy, so his jobs should stage-out OK from now on.
ID: 31498 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1687
Credit: 103,078,694
RAC: 126,782
Message 31507 - Posted: 19 Jul 2017, 19:26:27 UTC - in response to Message 31498.  

Thanks, Ivan; everything seems to run okay now :-)
ID: 31507 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1005
Credit: 6,269,877
RAC: 404
Message 31509 - Posted: 19 Jul 2017, 20:02:54 UTC - in response to Message 31507.  

Thanks, Ivan; everything seems to run okay now :-)

Looks better, yes. It would have only affected Laurence's cluster (and my lone CERN Openstack VM), not any of youse guys' results.
ID: 31509 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2401
Credit: 225,492,250
RAC: 124,690
Message 31511 - Posted: 19 Jul 2017, 20:19:33 UTC - in response to Message 31509.  

It would have only affected Laurence's cluster

The ratio red:green during the last few hours was roughly 2:1.
If the red part shows the contribution of Laurence's cluster and the green part that of all other users ... very impressive.
:-)
ID: 31511 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1005
Credit: 6,269,877
RAC: 404
Message 31512 - Posted: 19 Jul 2017, 20:55:58 UTC - in response to Message 31511.  

It would have only affected Laurence's cluster

The ratio red:green during the last few hours was roughly 2:1.
If the red part shows the contribution of Laurence's cluster and the green part that of all other users ... very impressive.
:-)

Last I heard, he had activated 500 VMs... CERN VMs, nominally assigned to CMS, otherwise unused.
ID: 31512 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1005
Credit: 6,269,877
RAC: 404
Message 31532 - Posted: 21 Jul 2017, 8:24:49 UTC

My monitors are showing what might be just a glitch, but might also be the start of another disturbance. I'll know more when I'm in my office in 20 minutes or so...
ID: 31532 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1005
Credit: 6,269,877
RAC: 404
Message 31535 - Posted: 21 Jul 2017, 9:00:39 UTC - in response to Message 31532.  

My monitors are showing what might be just a glitch, but might also be the start of another disturbance. I'll know more when I'm in my office in 20 minutes or so...

Looks like a network disturbance. The number of running jobs has fallen, but stabilised. I'll keep my eye on it.
ID: 31535 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1005
Credit: 6,269,877
RAC: 404
Message 31544 - Posted: 21 Jul 2017, 22:35:56 UTC - in response to Message 31535.  

My monitors are showing what might be just a glitch, but might also be the start of another disturbance. I'll know more when I'm in my office in 20 minutes or so...

Looks like a network disturbance. The number of running jobs has fallen, but stabilised. I'll keep my eye on it.

I just noticed that the WMAgent developed a problem at 1028 UTC, but that was long after the disturbances I saw above so it's not something that affects our operation unduly. The usual people have been notified, it's unlikely to get fixed tonight and given the holiday situation in Europe I may have to chase it up through other channels on Monday.
ID: 31544 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1005
Credit: 6,269,877
RAC: 404
Message 31552 - Posted: 22 Jul 2017, 10:17:48 UTC - in response to Message 31535.  

My monitors are showing what might be just a glitch, but might also be the start of another disturbance. I'll know more when I'm in my office in 20 minutes or so...

Looks like a network disturbance. The number of running jobs has fallen, but stabilised. I'll keep my eye on it.

Some sort of 24-hour back-off; we've recovered the number of jobs we lost yesterday. Machines having their quotas cut due to too many "bad" tasks?
ID: 31552 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1005
Credit: 6,269,877
RAC: 404
Message 31818 - Posted: 5 Aug 2017, 4:29:12 UTC

Warning: WMAgent appears to have a failed component very recently. Queue seems to be exhausted. Please set No New Tasks or change to a backup app while I try to raise someone at CERN to fix it. Could be a problem given that this is expected to be the heaviest weekend of the year for holiday travel in Europe...
ID: 31818 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1687
Credit: 103,078,694
RAC: 126,782
Message 31820 - Posted: 5 Aug 2017, 6:56:23 UTC - in response to Message 31818.  

Warning: WMAgent appears to have a failed component very recently.

hm, obviously, the new version of the WMAgent fails quite as often as the previous version :-(
ID: 31820 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1005
Credit: 6,269,877
RAC: 404
Message 31822 - Posted: 5 Aug 2017, 7:10:22 UTC - in response to Message 31820.  

Warning: WMAgent appears to have a failed component very recently.

hm, obviously, the new version of the WMAgent fails quite as often as the previous version :-(

Tja. Can't comment as I'm not on the development team, but fortuitously not only was I up early enough to notice the failure as it happened, but also Alan was awake and aware to fix the problem within minutes! I'd give us 10/10 for responsiveness, YMMV.
ID: 31822 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2401
Credit: 225,492,250
RAC: 124,690
Message 31824 - Posted: 5 Aug 2017, 8:20:19 UTC - in response to Message 31822.  

I'd give us 10/10 for responsiveness ...

+1
... and 10/10 for communication and user information.
:-)
ID: 31824 · Report as offensive     Reply Quote
Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · 9 · 10 . . . 22 · Next

Message boards : CMS Application : CMS Tasks Failing


©2024 CERN