CMS Tasks Failing

Author	Message
Erich56 Send message Joined: 18 Dec 15 Posts: 1908 Credit: 144,419,920 RAC: 75,423	Message 31490 - Posted: 18 Jul 2017, 17:26:32 UTC - in response to Message 31483. I have startet 1 task ea. on two of my PCs. So far, everything seems fine. Yet I'll wait for a while before I start CMS on my main PC. ID: 31490 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1110 Credit: 9,354,632 RAC: 3,035	Message 31491 - Posted: 18 Jul 2017, 17:53:43 UTC - in response to Message 31490. I've had a couple of jobs finish, but it will be 12-18 hours before any tasks finish. That's the point when I'll be able to stop holding my breath. I haven't heard yet exactly what the problem was; I'll try to pass it on if I do. ID: 31491 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1110 Credit: 9,354,632 RAC: 3,035	Message 31492 - Posted: 18 Jul 2017, 18:58:37 UTC - in response to Message 31491. I've had a couple of jobs finish, but it will be 12-18 hours before any tasks finish. That's the point when I'll be able to stop holding my breath. I haven't heard yet exactly what the problem was; I'll try to pass it on if I do. I ... do not believe it! The problem apparently was that some[one\|thing] closed the http and https ports to Data Bridge in the firewall!!! Why would anyone do that over a weekend? Investigations are continuing. ID: 31492 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1908 Credit: 144,419,920 RAC: 75,423	Message 31493 - Posted: 19 Jul 2017, 5:10:30 UTC But the WallClock Consumption Dashboard again looks strange, doesn't it? ID: 31493 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2679 Credit: 286,766,654 RAC: 80,431	Message 31494 - Posted: 19 Jul 2017, 6:09:35 UTC This morning it doesn't look like a typical firewall problem. The data bridge is vc-cms-output.cs3.cern.ch. dig vc-cms-output.cs3.cern.ch. +short cs3.cern.ch. 188.185.79.228 188.184.86.186 188.184.81.67 188.184.83.125 188.185.70.64 188.184.87.159 188.184.94.97 188.184.95.16 188.185.79.29 188.184.95.79 dig cs3.cern.ch. +short \|xargs -n1 -I {} nc -z -v -w 5 {} 80 Connection to 188.185.79.228 80 port [tcp/http] succeeded! Connection to 188.184.83.125 80 port [tcp/http] succeeded! Connection to 188.185.79.29 80 port [tcp/http] succeeded! Connection to 188.184.87.159 80 port [tcp/http] succeeded! Connection to 188.184.94.97 80 port [tcp/http] succeeded! Connection to 188.184.95.79 80 port [tcp/http] succeeded! Connection to 188.184.95.16 80 port [tcp/http] succeeded! Connection to 188.185.70.64 80 port [tcp/http] succeeded! Connection to 188.184.86.186 80 port [tcp/http] succeeded! Connection to 188.184.81.67 80 port [tcp/http] succeeded! dig cs3.cern.ch. +short \|xargs -n1 -I {} nc -z -v -w 5 {} 443 Connection to 188.184.95.16 443 port [tcp/https] succeeded! Connection to 188.185.79.228 443 port [tcp/https] succeeded! Connection to 188.184.87.159 443 port [tcp/https] succeeded! Connection to 188.184.94.97 443 port [tcp/https] succeeded! Connection to 188.184.86.186 443 port [tcp/https] succeeded! Connection to 188.184.95.79 443 port [tcp/https] succeeded! Connection to 188.184.83.125 443 port [tcp/https] succeeded! Connection to 188.184.81.67 443 port [tcp/https] succeeded! Connection to 188.185.70.64 443 port [tcp/https] succeeded! Connection to 188.185.79.29 443 port [tcp/https] succeeded! At least this line (occures several times) from the stderr.log/stdout.log points out that there is an error in the script: /var/lib/condor/execute/dir_6776/startup_environment.sh: line 52: syntax error near unexpected token `(' In addition there was a very unusual transient DNS error regarding a Theory WU last night. See: https://lhcathome.cern.ch/lhcathome/result.php?resultid=151633754 2017-07-18 22:10:04 (14553): Guest Log: [DEBUG] Testing CVMFS connection to lhchomeproxy.cern.ch on port 3125 2017-07-18 22:10:04 (14553): Guest Log: [DEBUG] nc: getaddrinfo: Temporary failure in name resolution 2017-07-18 22:10:04 (14553): Guest Log: [DEBUG] 1 2017-07-18 22:10:04 (14553): Guest Log: [ERROR] Could not connect to lhchomeproxy.cern.ch on port 3125 2017-07-18 22:10:04 (14553): Guest Log: [INFO] Shutting Down. ID: 31494 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1110 Credit: 9,354,632 RAC: 3,035	Message 31496 - Posted: 19 Jul 2017, 9:50:33 UTC - in response to Message 31494. Last modified: 19 Jul 2017, 9:50:44 UTC At least this line (occures several times) from the stderr.log/stdout.log points out that there is an error in the script: /var/lib/condor/execute/dir_6776/startup_environment.sh: line 52: syntax error near unexpected token `(' I mentioned that to Laurence in an email on Saturday, but I suspect he's on holiday or otherwise discommoded. ID: 31496 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1110 Credit: 9,354,632 RAC: 3,035	Message 31497 - Posted: 19 Jul 2017, 9:55:15 UTC - in response to Message 31494. I'm not sure what's going on. It looks like stage-out problems again but LHC@Home assures me that the firewall hasn't been tampered with again. All my tasks are running OK, so unless and until I get a log-file of a failed stage-out I'm just running around waving my finger in the air and guessing. Excuse me for a little while, I have a cryo-magnet to fill with liquid nitrogen... ID: 31497 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1110 Credit: 9,354,632 RAC: 3,035	Message 31498 - Posted: 19 Jul 2017, 12:08:11 UTC - in response to Message 31497. Sorted, I think. My fault -- Laurence's cluster uses my credentials and I misjudged when I needed to create a new one (the warning messages start about four days out and give the remaining lifetime in seconds...), so the old one ran out. I just copied across a new 9-day proxy, so his jobs should stage-out OK from now on. ID: 31498 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1908 Credit: 144,419,920 RAC: 75,423	Message 31507 - Posted: 19 Jul 2017, 19:26:27 UTC - in response to Message 31498. Thanks, Ivan; everything seems to run okay now :-) ID: 31507 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1110 Credit: 9,354,632 RAC: 3,035	Message 31509 - Posted: 19 Jul 2017, 20:02:54 UTC - in response to Message 31507. Thanks, Ivan; everything seems to run okay now :-) Looks better, yes. It would have only affected Laurence's cluster (and my lone CERN Openstack VM), not any of youse guys' results. ID: 31509 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2679 Credit: 286,766,654 RAC: 80,431	Message 31511 - Posted: 19 Jul 2017, 20:19:33 UTC - in response to Message 31509. It would have only affected Laurence's cluster The ratio red:green during the last few hours was roughly 2:1. If the red part shows the contribution of Laurence's cluster and the green part that of all other users ... very impressive. :-) ID: 31511 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1110 Credit: 9,354,632 RAC: 3,035	Message 31512 - Posted: 19 Jul 2017, 20:55:58 UTC - in response to Message 31511. It would have only affected Laurence's cluster The ratio red:green during the last few hours was roughly 2:1. If the red part shows the contribution of Laurence's cluster and the green part that of all other users ... very impressive. :-) Last I heard, he had activated 500 VMs... CERN VMs, nominally assigned to CMS, otherwise unused. ID: 31512 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1110 Credit: 9,354,632 RAC: 3,035	Message 31532 - Posted: 21 Jul 2017, 8:24:49 UTC My monitors are showing what might be just a glitch, but might also be the start of another disturbance. I'll know more when I'm in my office in 20 minutes or so... ID: 31532 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1110 Credit: 9,354,632 RAC: 3,035	Message 31535 - Posted: 21 Jul 2017, 9:00:39 UTC - in response to Message 31532. My monitors are showing what might be just a glitch, but might also be the start of another disturbance. I'll know more when I'm in my office in 20 minutes or so... Looks like a network disturbance. The number of running jobs has fallen, but stabilised. I'll keep my eye on it. ID: 31535 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1110 Credit: 9,354,632 RAC: 3,035	Message 31544 - Posted: 21 Jul 2017, 22:35:56 UTC - in response to Message 31535. My monitors are showing what might be just a glitch, but might also be the start of another disturbance. I'll know more when I'm in my office in 20 minutes or so... Looks like a network disturbance. The number of running jobs has fallen, but stabilised. I'll keep my eye on it. I just noticed that the WMAgent developed a problem at 1028 UTC, but that was long after the disturbances I saw above so it's not something that affects our operation unduly. The usual people have been notified, it's unlikely to get fixed tonight and given the holiday situation in Europe I may have to chase it up through other channels on Monday. ID: 31544 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1110 Credit: 9,354,632 RAC: 3,035	Message 31552 - Posted: 22 Jul 2017, 10:17:48 UTC - in response to Message 31535. My monitors are showing what might be just a glitch, but might also be the start of another disturbance. I'll know more when I'm in my office in 20 minutes or so... Looks like a network disturbance. The number of running jobs has fallen, but stabilised. I'll keep my eye on it. Some sort of 24-hour back-off; we've recovered the number of jobs we lost yesterday. Machines having their quotas cut due to too many "bad" tasks? ID: 31552 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1110 Credit: 9,354,632 RAC: 3,035	Message 31818 - Posted: 5 Aug 2017, 4:29:12 UTC Warning: WMAgent appears to have a failed component very recently. Queue seems to be exhausted. Please set No New Tasks or change to a backup app while I try to raise someone at CERN to fix it. Could be a problem given that this is expected to be the heaviest weekend of the year for holiday travel in Europe... ID: 31818 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1908 Credit: 144,419,920 RAC: 75,423	Message 31820 - Posted: 5 Aug 2017, 6:56:23 UTC - in response to Message 31818. Warning: WMAgent appears to have a failed component very recently. hm, obviously, the new version of the WMAgent fails quite as often as the previous version :-( ID: 31820 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1110 Credit: 9,354,632 RAC: 3,035	Message 31822 - Posted: 5 Aug 2017, 7:10:22 UTC - in response to Message 31820. Warning: WMAgent appears to have a failed component very recently. hm, obviously, the new version of the WMAgent fails quite as often as the previous version :-( Tja. Can't comment as I'm not on the development team, but fortuitously not only was I up early enough to notice the failure as it happened, but also Alan was awake and aware to fix the problem within minutes! I'd give us 10/10 for responsiveness, YMMV. ID: 31822 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2679 Credit: 286,766,654 RAC: 80,431	Message 31824 - Posted: 5 Aug 2017, 8:20:19 UTC - in response to Message 31822. I'd give us 10/10 for responsiveness ... +1 ... and 10/10 for communication and user information. :-) ID: 31824 · Reply Quote

LHC@home