Message boards : CMS Application : CMS Tasks Failing
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 10 · 11 · 12 · 13 · 14 · 15 · 16 . . . 22 · Next

AuthorMessage
Erich56

Send message
Joined: 18 Dec 15
Posts: 1688
Credit: 103,862,314
RAC: 121,711
Message 33614 - Posted: 1 Jan 2018, 14:28:01 UTC - in response to Message 33613.  

Yes, the agent has died.
The agent appears to have just been restarted
It's really too bad that the agent fails that often. Seems to be rather unstable :-(

Recently, I talked to someone who is knowledgeable about WMAgent - obviously a well functioning WMAgent depends on precise alignment and calibration. Maybe that's where there are still deficits; that's just my guess.

Ivan, is it the new WMAgent that has been in operation lately, or was it still the old one? I am asking because I have the impression that since the change to the new realease, the number of failures has gone up markedly.
ID: 33614 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1006
Credit: 6,272,228
RAC: 352
Message 33617 - Posted: 1 Jan 2018, 15:51:35 UTC - in response to Message 33614.  

Yes, the agent has died.
The agent appears to have just been restarted
It's really too bad that the agent fails that often. Seems to be rather unstable :-(

Recently, I talked to someone who is knowledgeable about WMAgent - obviously a well functioning WMAgent depends on precise alignment and calibration. Maybe that's where there are still deficits; that's just my guess.

Ivan, is it the new WMAgent that has been in operation lately, or was it still the old one? I am asking because I have the impression that since the change to the new realease, the number of failures has gone up markedly.

It's the new one, on CentOS7. The failures over the past several days have been to do with the DB, "It seem it was MySQL(MariaDB) connection problem", so there might need to be some parameter tweaking needed. We were pushing it hard when the other projects had no tasks to distribute and it was noticeable that the queue was not being maintained as the number of running jobs increased (https://batch-carbon.cern.ch/grafana/dashboard/db/cluster-batch-jobs?var-cluster=vcpool&from=now-24h&to=now-5m if you happen to have appropriate CMS or CERN credentials). Well, all the experts should be back on board in the next few days so they might be able to give it a thought or two. I'm hoping to press on with other developments very soon as well, but I seem to have to fight every step of the way.
ID: 33617 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1688
Credit: 103,862,314
RAC: 121,711
Message 33621 - Posted: 2 Jan 2018, 6:06:48 UTC - in response to Message 33617.  

... I'm hoping to press on with other developments very soon as well, but I seem to have to fight every step of the way.
Ivan, as always, thanks for your help - please continue to fight :-)
ID: 33621 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1688
Credit: 103,862,314
RAC: 121,711
Message 33628 - Posted: 2 Jan 2018, 15:23:18 UTC
Last modified: 2 Jan 2018, 15:56:24 UTC

This morning, I have had two terminations of tasks:

1) Status "194 (0x000000C2) EXIT_ABORTED_BY_CLIENT" after 13 hours 17 minutes. I definitely did NOT abort the task.
https://lhcathome.cern.ch/lhcathome/result.php?resultid=171847629

2) Status "206 (0x000000CE) EXIT_INIT_FAILURE" after 1 hours 19 minutes. STDERR says: Guest Log: [ERROR] Condor exited after 52997s without running a job.
https://lhcathome.cern.ch/lhcathome/result.php?resultid=171841499

These kind of errors I had never before (normally, when there were EXIT_INIT_FAILURES, the tasks failed after 10-15 minutes)

Anyone any idea what's going on?
ID: 33628 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1006
Credit: 6,272,228
RAC: 352
Message 33633 - Posted: 2 Jan 2018, 18:38:10 UTC - in response to Message 33628.  

This morning, I have had two terminations of tasks:

1) Status "194 (0x000000C2) EXIT_ABORTED_BY_CLIENT" after 13 hours 17 minutes. I definitely did NOT abort the task.
https://lhcathome.cern.ch/lhcathome/result.php?resultid=171847629

2) Status "206 (0x000000CE) EXIT_INIT_FAILURE" after 1 hours 19 minutes. STDERR says: Guest Log: [ERROR] Condor exited after 52997s without running a job.
https://lhcathome.cern.ch/lhcathome/result.php?resultid=171841499

These kind of errors I had never before (normally, when there were EXIT_INIT_FAILURES, the tasks failed after 10-15 minutes)

Anyone any idea what's going on?

There was some sort of glitch on the CMS Jobs graphs around 0800-0900 UTC, but I can't correlate that to the times in your job logs. The first one ran several jobs, but then appeared not to get a new job from the Condor server and shut down. As far as I know it's supposed to report success in that case. The second one appears to have been stopped for ~17 hours overnight, and when it was restarted it thought Condor hadn't been running any jobs for too long a period and so stopped. Hard to say why it didn't restart OK. In my experience it doesn't always continue with the saved image, and pulls another job, but I don't think I've ever had a pause as long as this one -- perhaps there was a timeout exceeded, I'm not au fait with the several timeout limits you can set with Condor.
ID: 33633 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1280
Credit: 8,493,530
RAC: 2,126
Message 33635 - Posted: 2 Jan 2018, 22:00:10 UTC - in response to Message 33628.  

Anyone any idea what's going on?
That is caused by the rare BOINC condition: "finish file present too long".
Have seen that on several projects, but for me not clear what's causing the long presence of the finish file.
Anyway BOINC aborts the task and will not wait longer.
ID: 33635 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1688
Credit: 103,862,314
RAC: 121,711
Message 33640 - Posted: 3 Jan 2018, 10:08:03 UTC

thanks, guys, for your explanations !
ID: 33640 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1006
Credit: 6,272,228
RAC: 352
Message 33757 - Posted: 9 Jan 2018, 21:09:16 UTC

Sorry to say, the WMAgent has decided to take a rest again and the job queue is draining fast; we'll be running dry in an hour or so. I've notified CERN, hope it's not to late for someone to kick it back to life in time.
ID: 33757 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1006
Credit: 6,272,228
RAC: 352
Message 33758 - Posted: 9 Jan 2018, 22:03:55 UTC - in response to Message 33757.  

Ah, our Fermilab contact had it up and running again within about five minutes. Good work! The job queue is full again now.
ID: 33758 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1688
Credit: 103,862,314
RAC: 121,711
Message 34029 - Posted: 22 Jan 2018, 12:16:20 UTC

Within about 1 hour, two CMS tasks failed after 2.838 and 459 seconds.

https://lhcathome.cern.ch/lhcathome/result.php?resultid=174137107
206 (0x000000CE) EXIT_INIT_FAILURE

https://lhcathome.cern.ch/lhcathome/result.php?resultid=174139196
-152 (0xFFFFFF68) ERR_NETOPEN ([ERROR] Could not connect to Condor server on port 9618)

Besides, also quite a number of Theory tasks failed at about the same time with: 207 (0x000000CF) EXIT_NO_SUB_TASKS

What could be the reason for these problems?
ID: 34029 · Report as offensive     Reply Quote
Profile Ben Segal
Volunteer moderator
Project administrator

Send message
Joined: 1 Sep 04
Posts: 139
Credit: 2,579
RAC: 0
Message 34032 - Posted: 22 Jan 2018, 13:13:27 UTC - in response to Message 34029.  

Thanks for the heads-up. I just asked our system manager (Nils) about it and he replied:

"This is probably because the Condor node handling jobs for Theory is being updated for Spectre and Meltdown today.

https://cern.service-now.com/service-portal/view-outage.do?n=OTG0041682

Should be back again soon.

Cheers, Nils"

Probably the CMS problems have the same cause.
ID: 34032 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1006
Credit: 6,272,228
RAC: 352
Message 34035 - Posted: 22 Jan 2018, 14:41:27 UTC - in response to Message 34032.  

We had a hypervisor reboot for the same reason this morning, affecting our WMAgent. It's up again now.
ID: 34035 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1688
Credit: 103,862,314
RAC: 121,711
Message 34078 - Posted: 25 Jan 2018, 6:09:18 UTC - in response to Message 30718.  

CMS tasks are still failing once in a while, with the following error message:

2018-01-25 06:45:59 (8188): Guest Log: [DEBUG] Connection to vccs.cern.ch 443 port [tcp/https] succeeded!
2018-01-25 06:45:59 (8188): Guest Log: [DEBUG] 0
2018-01-25 06:45:59 (8188): Guest Log: [DEBUG] Testing connection to Condor server on port 9618
2018-01-25 06:46:29 (8188): Guest Log: [DEBUG] nc: connect to vccondor01.cern.ch port 9618 (tcp) timed out: Operation now in progress
2018-01-25 06:46:29 (8188): Guest Log: [DEBUG] 1
2018-01-25 06:46:29 (8188): Guest Log: [ERROR] Could not connect to Condor server on port 9618
2018-01-25 06:46:29 (8188): Guest Log: [INFO] Shutting Down.


As so often, the problem is the connection to the Condor server. Why so? Has this ever been looked into thoroughly?
ID: 34078 · Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 1127
Credit: 49,750,941
RAC: 8,813
Message 34120 - Posted: 28 Jan 2018, 0:43:12 UTC

ID: 34120 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1688
Credit: 103,862,314
RAC: 121,711
Message 34126 - Posted: 28 Jan 2018, 12:16:51 UTC - in response to Message 33605.  

Ivan wrote on 12/31/2017:
No, I don't know why, but I have seen it myself on my University servers so I guess there is some general network problem.
I'll try to look into it next week; don't feel shy about reminding me if you don't hear anything from me!
hello Ivan, as the problems persist and everyday some tasks fail with

2018-01-28 12:26:46 (7296): Guest Log: 01/28/18 12:28:29 attempt to connect to <128.142.142.167:9618> failed: Connection timed out (connect errno = 110).
2018-01-28 12:26:46 (7296): Guest Log: ERROR: failed to make connection to <128.142.142.167:9618>
2018-01-28 12:26:47 (7296): Guest Log: [ERROR] Could not ping HTCondor.


I'd like to remind you, as offered by you above. Many thanks.
ID: 34126 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1006
Credit: 6,272,228
RAC: 352
Message 34130 - Posted: 28 Jan 2018, 17:47:11 UTC - in response to Message 34126.  
Last modified: 28 Jan 2018, 18:03:07 UTC

Yes, I did start to look into it a bit, but it only happens very occasionally to me. I wrote a script to parse log files on the server, but soon realised that even testing 5 or 6 a second it was going to take ages to scan all the logs -- there doesn't seem to be a way of picking out only CMS tasks. I'll mention it to Laurence again, but he's a bit preoccupied with other matters at the moment.

$ cat geterrors.sh
#!/bin/bash
for ((i=171834459; i<172242548; i++))
do rm -f log
echo $i
wget -q -O log https://lhcathome.cern.ch/lhcathome/result.php?resultid=$i
gawk -f getlogs.awk log
done

$ cat getlogs.awk
/Name/ { if ((getline tmp) >0)
{ if (index(tmp,"CMS_") <1) { exit;}
}}
/Computer ID/ { if ((getline tmp) >0)
{ if (index(tmp,"style") <1) {exit;}
else { split(tmp,a,">"); split(a[3],b,"<"); id=b[1];}
}}
/Could not connect to Condor/ { print id, $0; exit;}
ID: 34130 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1688
Credit: 103,862,314
RAC: 121,711
Message 34131 - Posted: 28 Jan 2018, 17:53:05 UTC - in response to Message 34130.  

Thanks for your efforts, anyway, Ivan :-)
ID: 34131 · Report as offensive     Reply Quote
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 807
Credit: 652,451,754
RAC: 279,689
Message 34132 - Posted: 28 Jan 2018, 18:58:10 UTC

For me I have 2.6% of CMS fail, it's OK for me normally they only waste 10min.
ID: 34132 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1006
Credit: 6,272,228
RAC: 352
Message 34133 - Posted: 28 Jan 2018, 20:12:52 UTC - in response to Message 34132.  

For me I have 2.6% of CMS fail, it's OK for me normally they only waste 10min.

Is it a consistent reason, Toby, like the failure to contact Condor that we've been discussing?
2.6% is "reasonable" given the overall problems and the complexity of the workflow chain. I'd prefer it to be much less, of course!
ID: 34133 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1688
Credit: 103,862,314
RAC: 121,711
Message 34150 - Posted: 29 Jan 2018, 19:24:21 UTC - in response to Message 34126.  

hello Ivan, as the problems persist and everyday some tasks fail with

2018-01-28 12:26:46 (7296): Guest Log: 01/28/18 12:28:29 attempt to connect to <128.142.142.167:9618> failed: Connection timed out (connect errno = 110).
2018-01-28 12:26:46 (7296): Guest Log: ERROR: failed to make connection to <128.142.142.167:9618>
2018-01-28 12:26:47 (7296): Guest Log: [ERROR] Could not ping HTCondor.
this afternoon, I had 4 task failing in a row :-(
ID: 34150 · Report as offensive     Reply Quote
Previous · 1 . . . 10 · 11 · 12 · 13 · 14 · 15 · 16 . . . 22 · Next

Message boards : CMS Application : CMS Tasks Failing


©2024 CERN