Message boards :
CMS Application :
CMS Tasks Failing
Message board moderation
Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · 9 · 10 . . . 22 · Next
Author | Message |
---|---|
Send message Joined: 18 Dec 15 Posts: 1783 Credit: 116,948,638 RAC: 67,863 |
I have startet 1 task ea. on two of my PCs. So far, everything seems fine. Yet I'll wait for a while before I start CMS on my main PC. |
Send message Joined: 29 Aug 05 Posts: 1054 Credit: 7,633,643 RAC: 6,574 |
|
Send message Joined: 29 Aug 05 Posts: 1054 Credit: 7,633,643 RAC: 6,574 |
I've had a couple of jobs finish, but it will be 12-18 hours before any tasks finish. That's the point when I'll be able to stop holding my breath. I haven't heard yet exactly what the problem was; I'll try to pass it on if I do. I ... do not believe it! The problem apparently was that some[one|thing] closed the http and https ports to Data Bridge in the firewall!!! Why would anyone do that over a weekend? Investigations are continuing. |
Send message Joined: 18 Dec 15 Posts: 1783 Credit: 116,948,638 RAC: 67,863 |
But the WallClock Consumption Dashboard again looks strange, doesn't it? |
Send message Joined: 15 Jun 08 Posts: 2520 Credit: 251,306,838 RAC: 120,310 |
This morning it doesn't look like a typical firewall problem. The data bridge is vc-cms-output.cs3.cern.ch. dig vc-cms-output.cs3.cern.ch. +short At least this line (occures several times) from the stderr.log/stdout.log points out that there is an error in the script: /var/lib/condor/execute/dir_6776/startup_environment.sh: line 52: syntax error near unexpected token `(' In addition there was a very unusual transient DNS error regarding a Theory WU last night. See: https://lhcathome.cern.ch/lhcathome/result.php?resultid=151633754 2017-07-18 22:10:04 (14553): Guest Log: [DEBUG] Testing CVMFS connection to lhchomeproxy.cern.ch on port 3125 |
Send message Joined: 29 Aug 05 Posts: 1054 Credit: 7,633,643 RAC: 6,574 |
At least this line (occures several times) from the stderr.log/stdout.log points out that there is an error in the script: I mentioned that to Laurence in an email on Saturday, but I suspect he's on holiday or otherwise discommoded. |
Send message Joined: 29 Aug 05 Posts: 1054 Credit: 7,633,643 RAC: 6,574 |
I'm not sure what's going on. It looks like stage-out problems again but LHC@Home assures me that the firewall hasn't been tampered with again. All my tasks are running OK, so unless and until I get a log-file of a failed stage-out I'm just running around waving my finger in the air and guessing. Excuse me for a little while, I have a cryo-magnet to fill with liquid nitrogen... |
Send message Joined: 29 Aug 05 Posts: 1054 Credit: 7,633,643 RAC: 6,574 |
Sorted, I think. My fault -- Laurence's cluster uses my credentials and I misjudged when I needed to create a new one (the warning messages start about four days out and give the remaining lifetime in seconds...), so the old one ran out. I just copied across a new 9-day proxy, so his jobs should stage-out OK from now on. |
Send message Joined: 18 Dec 15 Posts: 1783 Credit: 116,948,638 RAC: 67,863 |
Thanks, Ivan; everything seems to run okay now :-) |
Send message Joined: 29 Aug 05 Posts: 1054 Credit: 7,633,643 RAC: 6,574 |
|
Send message Joined: 15 Jun 08 Posts: 2520 Credit: 251,306,838 RAC: 120,310 |
It would have only affected Laurence's cluster The ratio red:green during the last few hours was roughly 2:1. If the red part shows the contribution of Laurence's cluster and the green part that of all other users ... very impressive. :-) |
Send message Joined: 29 Aug 05 Posts: 1054 Credit: 7,633,643 RAC: 6,574 |
It would have only affected Laurence's cluster Last I heard, he had activated 500 VMs... CERN VMs, nominally assigned to CMS, otherwise unused. |
Send message Joined: 29 Aug 05 Posts: 1054 Credit: 7,633,643 RAC: 6,574 |
|
Send message Joined: 29 Aug 05 Posts: 1054 Credit: 7,633,643 RAC: 6,574 |
|
Send message Joined: 29 Aug 05 Posts: 1054 Credit: 7,633,643 RAC: 6,574 |
My monitors are showing what might be just a glitch, but might also be the start of another disturbance. I'll know more when I'm in my office in 20 minutes or so... I just noticed that the WMAgent developed a problem at 1028 UTC, but that was long after the disturbances I saw above so it's not something that affects our operation unduly. The usual people have been notified, it's unlikely to get fixed tonight and given the holiday situation in Europe I may have to chase it up through other channels on Monday. |
Send message Joined: 29 Aug 05 Posts: 1054 Credit: 7,633,643 RAC: 6,574 |
My monitors are showing what might be just a glitch, but might also be the start of another disturbance. I'll know more when I'm in my office in 20 minutes or so... Some sort of 24-hour back-off; we've recovered the number of jobs we lost yesterday. Machines having their quotas cut due to too many "bad" tasks? |
Send message Joined: 29 Aug 05 Posts: 1054 Credit: 7,633,643 RAC: 6,574 |
Warning: WMAgent appears to have a failed component very recently. Queue seems to be exhausted. Please set No New Tasks or change to a backup app while I try to raise someone at CERN to fix it. Could be a problem given that this is expected to be the heaviest weekend of the year for holiday travel in Europe... |
Send message Joined: 18 Dec 15 Posts: 1783 Credit: 116,948,638 RAC: 67,863 |
Warning: WMAgent appears to have a failed component very recently. hm, obviously, the new version of the WMAgent fails quite as often as the previous version :-( |
Send message Joined: 29 Aug 05 Posts: 1054 Credit: 7,633,643 RAC: 6,574 |
Warning: WMAgent appears to have a failed component very recently. Tja. Can't comment as I'm not on the development team, but fortuitously not only was I up early enough to notice the failure as it happened, but also Alan was awake and aware to fix the problem within minutes! I'd give us 10/10 for responsiveness, YMMV. |
Send message Joined: 15 Jun 08 Posts: 2520 Credit: 251,306,838 RAC: 120,310 |
I'd give us 10/10 for responsiveness ... +1 ... and 10/10 for communication and user information. :-) |
©2024 CERN