Message boards :
CMS Application :
CMS Tasks Failing
Message board moderation
Previous · 1 . . . 10 · 11 · 12 · 13 · 14 · 15 · 16 . . . 22 · Next
Author | Message |
---|---|
Send message Joined: 18 Dec 15 Posts: 1835 Credit: 120,493,308 RAC: 74,341 |
Yes, the agent has died. The agent appears to have just been restartedIt's really too bad that the agent fails that often. Seems to be rather unstable :-( Recently, I talked to someone who is knowledgeable about WMAgent - obviously a well functioning WMAgent depends on precise alignment and calibration. Maybe that's where there are still deficits; that's just my guess. Ivan, is it the new WMAgent that has been in operation lately, or was it still the old one? I am asking because I have the impression that since the change to the new realease, the number of failures has gone up markedly. |
Send message Joined: 29 Aug 05 Posts: 1065 Credit: 8,104,119 RAC: 15,472 |
Yes, the agent has died.The agent appears to have just been restartedIt's really too bad that the agent fails that often. Seems to be rather unstable :-( It's the new one, on CentOS7. The failures over the past several days have been to do with the DB, "It seem it was MySQL(MariaDB) connection problem", so there might need to be some parameter tweaking needed. We were pushing it hard when the other projects had no tasks to distribute and it was noticeable that the queue was not being maintained as the number of running jobs increased (https://batch-carbon.cern.ch/grafana/dashboard/db/cluster-batch-jobs?var-cluster=vcpool&from=now-24h&to=now-5m if you happen to have appropriate CMS or CERN credentials). Well, all the experts should be back on board in the next few days so they might be able to give it a thought or two. I'm hoping to press on with other developments very soon as well, but I seem to have to fight every step of the way. |
Send message Joined: 18 Dec 15 Posts: 1835 Credit: 120,493,308 RAC: 74,341 |
... I'm hoping to press on with other developments very soon as well, but I seem to have to fight every step of the way.Ivan, as always, thanks for your help - please continue to fight :-) |
Send message Joined: 18 Dec 15 Posts: 1835 Credit: 120,493,308 RAC: 74,341 |
This morning, I have had two terminations of tasks: 1) Status "194 (0x000000C2) EXIT_ABORTED_BY_CLIENT" after 13 hours 17 minutes. I definitely did NOT abort the task. https://lhcathome.cern.ch/lhcathome/result.php?resultid=171847629 2) Status "206 (0x000000CE) EXIT_INIT_FAILURE" after 1 hours 19 minutes. STDERR says: Guest Log: [ERROR] Condor exited after 52997s without running a job. https://lhcathome.cern.ch/lhcathome/result.php?resultid=171841499 These kind of errors I had never before (normally, when there were EXIT_INIT_FAILURES, the tasks failed after 10-15 minutes) Anyone any idea what's going on? |
Send message Joined: 29 Aug 05 Posts: 1065 Credit: 8,104,119 RAC: 15,472 |
This morning, I have had two terminations of tasks: There was some sort of glitch on the CMS Jobs graphs around 0800-0900 UTC, but I can't correlate that to the times in your job logs. The first one ran several jobs, but then appeared not to get a new job from the Condor server and shut down. As far as I know it's supposed to report success in that case. The second one appears to have been stopped for ~17 hours overnight, and when it was restarted it thought Condor hadn't been running any jobs for too long a period and so stopped. Hard to say why it didn't restart OK. In my experience it doesn't always continue with the saved image, and pulls another job, but I don't think I've ever had a pause as long as this one -- perhaps there was a timeout exceeded, I'm not au fait with the several timeout limits you can set with Condor. |
Send message Joined: 14 Jan 10 Posts: 1432 Credit: 9,595,867 RAC: 4,807 |
Anyone any idea what's going on?That is caused by the rare BOINC condition: "finish file present too long". Have seen that on several projects, but for me not clear what's causing the long presence of the finish file. Anyway BOINC aborts the task and will not wait longer. |
Send message Joined: 18 Dec 15 Posts: 1835 Credit: 120,493,308 RAC: 74,341 |
thanks, guys, for your explanations ! |
Send message Joined: 29 Aug 05 Posts: 1065 Credit: 8,104,119 RAC: 15,472 |
|
Send message Joined: 29 Aug 05 Posts: 1065 Credit: 8,104,119 RAC: 15,472 |
|
Send message Joined: 18 Dec 15 Posts: 1835 Credit: 120,493,308 RAC: 74,341 |
Within about 1 hour, two CMS tasks failed after 2.838 and 459 seconds. https://lhcathome.cern.ch/lhcathome/result.php?resultid=174137107 206 (0x000000CE) EXIT_INIT_FAILURE https://lhcathome.cern.ch/lhcathome/result.php?resultid=174139196 -152 (0xFFFFFF68) ERR_NETOPEN ([ERROR] Could not connect to Condor server on port 9618) Besides, also quite a number of Theory tasks failed at about the same time with: 207 (0x000000CF) EXIT_NO_SUB_TASKS What could be the reason for these problems? |
Send message Joined: 1 Sep 04 Posts: 140 Credit: 2,579 RAC: 0 |
Thanks for the heads-up. I just asked our system manager (Nils) about it and he replied: "This is probably because the Condor node handling jobs for Theory is being updated for Spectre and Meltdown today. https://cern.service-now.com/service-portal/view-outage.do?n=OTG0041682 Should be back again soon. Cheers, Nils" Probably the CMS problems have the same cause. |
Send message Joined: 29 Aug 05 Posts: 1065 Credit: 8,104,119 RAC: 15,472 |
|
Send message Joined: 18 Dec 15 Posts: 1835 Credit: 120,493,308 RAC: 74,341 |
CMS tasks are still failing once in a while, with the following error message: 2018-01-25 06:45:59 (8188): Guest Log: [DEBUG] Connection to vccs.cern.ch 443 port [tcp/https] succeeded! 2018-01-25 06:45:59 (8188): Guest Log: [DEBUG] 0 2018-01-25 06:45:59 (8188): Guest Log: [DEBUG] Testing connection to Condor server on port 9618 2018-01-25 06:46:29 (8188): Guest Log: [DEBUG] nc: connect to vccondor01.cern.ch port 9618 (tcp) timed out: Operation now in progress 2018-01-25 06:46:29 (8188): Guest Log: [DEBUG] 1 2018-01-25 06:46:29 (8188): Guest Log: [ERROR] Could not connect to Condor server on port 9618 2018-01-25 06:46:29 (8188): Guest Log: [INFO] Shutting Down. As so often, the problem is the connection to the Condor server. Why so? Has this ever been looked into thoroughly? |
Send message Joined: 24 Oct 04 Posts: 1183 Credit: 56,210,240 RAC: 60,849 |
|
Send message Joined: 18 Dec 15 Posts: 1835 Credit: 120,493,308 RAC: 74,341 |
Ivan wrote on 12/31/2017: No, I don't know why, but I have seen it myself on my University servers so I guess there is some general network problem.hello Ivan, as the problems persist and everyday some tasks fail with 2018-01-28 12:26:46 (7296): Guest Log: 01/28/18 12:28:29 attempt to connect to <128.142.142.167:9618> failed: Connection timed out (connect errno = 110). 2018-01-28 12:26:46 (7296): Guest Log: ERROR: failed to make connection to <128.142.142.167:9618> 2018-01-28 12:26:47 (7296): Guest Log: [ERROR] Could not ping HTCondor. I'd like to remind you, as offered by you above. Many thanks. |
Send message Joined: 29 Aug 05 Posts: 1065 Credit: 8,104,119 RAC: 15,472 |
Yes, I did start to look into it a bit, but it only happens very occasionally to me. I wrote a script to parse log files on the server, but soon realised that even testing 5 or 6 a second it was going to take ages to scan all the logs -- there doesn't seem to be a way of picking out only CMS tasks. I'll mention it to Laurence again, but he's a bit preoccupied with other matters at the moment. $ cat geterrors.sh #!/bin/bash for ((i=171834459; i<172242548; i++)) do rm -f log echo $i wget -q -O log https://lhcathome.cern.ch/lhcathome/result.php?resultid=$i gawk -f getlogs.awk log done $ cat getlogs.awk /Name/ { if ((getline tmp) >0) { if (index(tmp,"CMS_") <1) { exit;} }} /Computer ID/ { if ((getline tmp) >0) { if (index(tmp,"style") <1) {exit;} else { split(tmp,a,">"); split(a[3],b,"<"); id=b[1];} }} /Could not connect to Condor/ { print id, $0; exit;} |
Send message Joined: 18 Dec 15 Posts: 1835 Credit: 120,493,308 RAC: 74,341 |
Thanks for your efforts, anyway, Ivan :-) |
Send message Joined: 27 Sep 08 Posts: 853 Credit: 696,081,431 RAC: 149,803 |
For me I have 2.6% of CMS fail, it's OK for me normally they only waste 10min. |
Send message Joined: 29 Aug 05 Posts: 1065 Credit: 8,104,119 RAC: 15,472 |
For me I have 2.6% of CMS fail, it's OK for me normally they only waste 10min. Is it a consistent reason, Toby, like the failure to contact Condor that we've been discussing? 2.6% is "reasonable" given the overall problems and the complexity of the workflow chain. I'd prefer it to be much less, of course! |
Send message Joined: 18 Dec 15 Posts: 1835 Credit: 120,493,308 RAC: 74,341 |
hello Ivan, as the problems persist and everyday some tasks fail withthis afternoon, I had 4 task failing in a row :-( |
©2025 CERN