CMS Tasks Failing

Author	Message
Erich56 Send message Joined: 18 Dec 15 Posts: 1922 Credit: 148,926,387 RAC: 146,110	Message 33614 - Posted: 1 Jan 2018, 14:28:01 UTC - in response to Message 33613. Yes, the agent has died. The agent appears to have just been restarted It's really too bad that the agent fails that often. Seems to be rather unstable :-( Recently, I talked to someone who is knowledgeable about WMAgent - obviously a well functioning WMAgent depends on precise alignment and calibration. Maybe that's where there are still deficits; that's just my guess. Ivan, is it the new WMAgent that has been in operation lately, or was it still the old one? I am asking because I have the impression that since the change to the new realease, the number of failures has gone up markedly. ID: 33614 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1119 Credit: 10,323,824 RAC: 20,456	Message 33617 - Posted: 1 Jan 2018, 15:51:35 UTC - in response to Message 33614. Yes, the agent has died. The agent appears to have just been restarted It's really too bad that the agent fails that often. Seems to be rather unstable :-( Recently, I talked to someone who is knowledgeable about WMAgent - obviously a well functioning WMAgent depends on precise alignment and calibration. Maybe that's where there are still deficits; that's just my guess. Ivan, is it the new WMAgent that has been in operation lately, or was it still the old one? I am asking because I have the impression that since the change to the new realease, the number of failures has gone up markedly. It's the new one, on CentOS7. The failures over the past several days have been to do with the DB, "It seem it was MySQL(MariaDB) connection problem", so there might need to be some parameter tweaking needed. We were pushing it hard when the other projects had no tasks to distribute and it was noticeable that the queue was not being maintained as the number of running jobs increased (https://batch-carbon.cern.ch/grafana/dashboard/db/cluster-batch-jobs?var-cluster=vcpool&from=now-24h&to=now-5m if you happen to have appropriate CMS or CERN credentials). Well, all the experts should be back on board in the next few days so they might be able to give it a thought or two. I'm hoping to press on with other developments very soon as well, but I seem to have to fight every step of the way. ID: 33617 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1922 Credit: 148,926,387 RAC: 146,110	Message 33621 - Posted: 2 Jan 2018, 6:06:48 UTC - in response to Message 33617. ... I'm hoping to press on with other developments very soon as well, but I seem to have to fight every step of the way. Ivan, as always, thanks for your help - please continue to fight :-) ID: 33621 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1922 Credit: 148,926,387 RAC: 146,110	Message 33628 - Posted: 2 Jan 2018, 15:23:18 UTC Last modified: 2 Jan 2018, 15:56:24 UTC This morning, I have had two terminations of tasks: 1) Status "194 (0x000000C2) EXIT_ABORTED_BY_CLIENT" after 13 hours 17 minutes. I definitely did NOT abort the task. https://lhcathome.cern.ch/lhcathome/result.php?resultid=171847629 2) Status "206 (0x000000CE) EXIT_INIT_FAILURE" after 1 hours 19 minutes. STDERR says: Guest Log: [ERROR] Condor exited after 52997s without running a job. https://lhcathome.cern.ch/lhcathome/result.php?resultid=171841499 These kind of errors I had never before (normally, when there were EXIT_INIT_FAILURES, the tasks failed after 10-15 minutes) Anyone any idea what's going on? ID: 33628 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1119 Credit: 10,323,824 RAC: 20,456	Message 33633 - Posted: 2 Jan 2018, 18:38:10 UTC - in response to Message 33628. This morning, I have had two terminations of tasks: 1) Status "194 (0x000000C2) EXIT_ABORTED_BY_CLIENT" after 13 hours 17 minutes. I definitely did NOT abort the task. https://lhcathome.cern.ch/lhcathome/result.php?resultid=171847629 2) Status "206 (0x000000CE) EXIT_INIT_FAILURE" after 1 hours 19 minutes. STDERR says: Guest Log: [ERROR] Condor exited after 52997s without running a job. https://lhcathome.cern.ch/lhcathome/result.php?resultid=171841499 These kind of errors I had never before (normally, when there were EXIT_INIT_FAILURES, the tasks failed after 10-15 minutes) Anyone any idea what's going on? There was some sort of glitch on the CMS Jobs graphs around 0800-0900 UTC, but I can't correlate that to the times in your job logs. The first one ran several jobs, but then appeared not to get a new job from the Condor server and shut down. As far as I know it's supposed to report success in that case. The second one appears to have been stopped for ~17 hours overnight, and when it was restarted it thought Condor hadn't been running any jobs for too long a period and so stopped. Hard to say why it didn't restart OK. In my experience it doesn't always continue with the saved image, and pulls another job, but I don't think I've ever had a pause as long as this one -- perhaps there was a timeout exceeded, I'm not au fait with the several timeout limits you can set with Condor. ID: 33633 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1469 Credit: 9,918,607 RAC: 1,682	Message 33635 - Posted: 2 Jan 2018, 22:00:10 UTC - in response to Message 33628. Anyone any idea what's going on? That is caused by the rare BOINC condition: "finish file present too long". Have seen that on several projects, but for me not clear what's causing the long presence of the finish file. Anyway BOINC aborts the task and will not wait longer. ID: 33635 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1922 Credit: 148,926,387 RAC: 146,110	Message 33640 - Posted: 3 Jan 2018, 10:08:03 UTC thanks, guys, for your explanations ! ID: 33640 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1119 Credit: 10,323,824 RAC: 20,456	Message 33757 - Posted: 9 Jan 2018, 21:09:16 UTC Sorry to say, the WMAgent has decided to take a rest again and the job queue is draining fast; we'll be running dry in an hour or so. I've notified CERN, hope it's not to late for someone to kick it back to life in time. ID: 33757 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1119 Credit: 10,323,824 RAC: 20,456	Message 33758 - Posted: 9 Jan 2018, 22:03:55 UTC - in response to Message 33757. Ah, our Fermilab contact had it up and running again within about five minutes. Good work! The job queue is full again now. ID: 33758 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1922 Credit: 148,926,387 RAC: 146,110	Message 34029 - Posted: 22 Jan 2018, 12:16:20 UTC Within about 1 hour, two CMS tasks failed after 2.838 and 459 seconds. https://lhcathome.cern.ch/lhcathome/result.php?resultid=174137107 206 (0x000000CE) EXIT_INIT_FAILURE https://lhcathome.cern.ch/lhcathome/result.php?resultid=174139196 -152 (0xFFFFFF68) ERR_NETOPEN ([ERROR] Could not connect to Condor server on port 9618) Besides, also quite a number of Theory tasks failed at about the same time with: 207 (0x000000CF) EXIT_NO_SUB_TASKS What could be the reason for these problems? ID: 34029 · Reply Quote

Ben Segal Volunteer moderator Project administrator Send message Joined: 1 Sep 04 Posts: 143 Credit: 2,579 RAC: 0	Message 34032 - Posted: 22 Jan 2018, 13:13:27 UTC - in response to Message 34029. Thanks for the heads-up. I just asked our system manager (Nils) about it and he replied: "This is probably because the Condor node handling jobs for Theory is being updated for Spectre and Meltdown today. https://cern.service-now.com/service-portal/view-outage.do?n=OTG0041682 Should be back again soon. Cheers, Nils" Probably the CMS problems have the same cause. ID: 34032 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1119 Credit: 10,323,824 RAC: 20,456	Message 34035 - Posted: 22 Jan 2018, 14:41:27 UTC - in response to Message 34032. We had a hypervisor reboot for the same reason this morning, affecting our WMAgent. It's up again now. ID: 34035 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1922 Credit: 148,926,387 RAC: 146,110	Message 34078 - Posted: 25 Jan 2018, 6:09:18 UTC - in response to Message 30718. CMS tasks are still failing once in a while, with the following error message: 2018-01-25 06:45:59 (8188): Guest Log: [DEBUG] Connection to vccs.cern.ch 443 port [tcp/https] succeeded! 2018-01-25 06:45:59 (8188): Guest Log: [DEBUG] 0 2018-01-25 06:45:59 (8188): Guest Log: [DEBUG] Testing connection to Condor server on port 9618 2018-01-25 06:46:29 (8188): Guest Log: [DEBUG] nc: connect to vccondor01.cern.ch port 9618 (tcp) timed out: Operation now in progress 2018-01-25 06:46:29 (8188): Guest Log: [DEBUG] 1 2018-01-25 06:46:29 (8188): Guest Log: [ERROR] Could not connect to Condor server on port 9618 2018-01-25 06:46:29 (8188): Guest Log: [INFO] Shutting Down. As so often, the problem is the connection to the Condor server. Why so? Has this ever been looked into thoroughly? ID: 34078 · Reply Quote

Magic Quantum Mechanic Send message Joined: 24 Oct 04 Posts: 1242 Credit: 84,510,170 RAC: 132,610	Message 34120 - Posted: 28 Jan 2018, 0:43:12 UTC Oy vey https://lhcathome.cern.ch/lhcathome/result.php?resultid=174194984 ID: 34120 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1922 Credit: 148,926,387 RAC: 146,110	Message 34126 - Posted: 28 Jan 2018, 12:16:51 UTC - in response to Message 33605. Ivan wrote on 12/31/2017: No, I don't know why, but I have seen it myself on my University servers so I guess there is some general network problem. I'll try to look into it next week; don't feel shy about reminding me if you don't hear anything from me! hello Ivan, as the problems persist and everyday some tasks fail with 2018-01-28 12:26:46 (7296): Guest Log: 01/28/18 12:28:29 attempt to connect to <128.142.142.167:9618> failed: Connection timed out (connect errno = 110). 2018-01-28 12:26:46 (7296): Guest Log: ERROR: failed to make connection to <128.142.142.167:9618> 2018-01-28 12:26:47 (7296): Guest Log: [ERROR] Could not ping HTCondor. I'd like to remind you, as offered by you above. Many thanks. ID: 34126 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1119 Credit: 10,323,824 RAC: 20,456	Message 34130 - Posted: 28 Jan 2018, 17:47:11 UTC - in response to Message 34126. Last modified: 28 Jan 2018, 18:03:07 UTC Yes, I did start to look into it a bit, but it only happens very occasionally to me. I wrote a script to parse log files on the server, but soon realised that even testing 5 or 6 a second it was going to take ages to scan all the logs -- there doesn't seem to be a way of picking out only CMS tasks. I'll mention it to Laurence again, but he's a bit preoccupied with other matters at the moment. $ cat geterrors.sh #!/bin/bash for ((i=171834459; i<172242548; i++)) do rm -f log echo $i wget -q -O log https://lhcathome.cern.ch/lhcathome/result.php?resultid=$i gawk -f getlogs.awk log done $ cat getlogs.awk /Name/ { if ((getline tmp) >0) { if (index(tmp,"CMS_") <1) { exit;} }} /Computer ID/ { if ((getline tmp) >0) { if (index(tmp,"style") <1) {exit;} else { split(tmp,a,">"); split(a[3],b,"<"); id=b[1];} }} /Could not connect to Condor/ { print id, $0; exit;} ID: 34130 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1922 Credit: 148,926,387 RAC: 146,110	Message 34131 - Posted: 28 Jan 2018, 17:53:05 UTC - in response to Message 34130. Thanks for your efforts, anyway, Ivan :-) ID: 34131 · Reply Quote

Toby Broom Volunteer moderator Send message Joined: 27 Sep 08 Posts: 888 Credit: 759,626,053 RAC: 361,289	Message 34132 - Posted: 28 Jan 2018, 18:58:10 UTC For me I have 2.6% of CMS fail, it's OK for me normally they only waste 10min. ID: 34132 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1119 Credit: 10,323,824 RAC: 20,456	Message 34133 - Posted: 28 Jan 2018, 20:12:52 UTC - in response to Message 34132. For me I have 2.6% of CMS fail, it's OK for me normally they only waste 10min. Is it a consistent reason, Toby, like the failure to contact Condor that we've been discussing? 2.6% is "reasonable" given the overall problems and the complexity of the workflow chain. I'd prefer it to be much less, of course! ID: 34133 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1922 Credit: 148,926,387 RAC: 146,110	Message 34150 - Posted: 29 Jan 2018, 19:24:21 UTC - in response to Message 34126. hello Ivan, as the problems persist and everyday some tasks fail with 2018-01-28 12:26:46 (7296): Guest Log: 01/28/18 12:28:29 attempt to connect to <128.142.142.167:9618> failed: Connection timed out (connect errno = 110). 2018-01-28 12:26:46 (7296): Guest Log: ERROR: failed to make connection to <128.142.142.167:9618> 2018-01-28 12:26:47 (7296): Guest Log: [ERROR] Could not ping HTCondor. this afternoon, I had 4 task failing in a row :-( ID: 34150 · Reply Quote

LHC@home