Message boards : CMS Application : CMS Tasks Failing
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 6 · 7 · 8 · 9 · 10 · 11 · 12 . . . 22 · Next

AuthorMessage
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1060
Credit: 7,737,452
RAC: 1,957
Message 32727 - Posted: 9 Oct 2017, 18:15:13 UTC - in response to Message 32720.  

Unfortunately, we still have problems. I'll update this thread as soon as I have any more news.
ID: 32727 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1811
Credit: 118,336,097
RAC: 25,785
Message 32728 - Posted: 9 Oct 2017, 18:25:12 UTC - in response to Message 32727.  

Thanks, Ivan, for keeping us posted :-)
ID: 32728 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1060
Credit: 7,737,452
RAC: 1,957
Message 32733 - Posted: 9 Oct 2017, 20:33:46 UTC - in response to Message 32728.  

Cheers, Erich. Sorry it's not good news yet. On top of that I've been having broadband problems at home tonight -- "Up and down like a whore's drawers!" to put it crudely.
ID: 32733 · Report as offensive     Reply Quote
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 847
Credit: 691,214,017
RAC: 104,702
Message 32734 - Posted: 9 Oct 2017, 20:52:54 UTC

looks the all the projects fell over not just CMS
ID: 32734 · Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer

Send message
Joined: 20 Jun 14
Posts: 380
Credit: 238,712
RAC: 0
Message 32736 - Posted: 9 Oct 2017, 21:14:19 UTC - in response to Message 32734.  

Hopefully back now.
ID: 32736 · Report as offensive     Reply Quote
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 847
Credit: 691,214,017
RAC: 104,702
Message 32737 - Posted: 9 Oct 2017, 21:23:24 UTC

Thanks Laurence, got to 12min so should be good.

I took the oppertunity to upgrade VBox so not bad :)
ID: 32737 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1060
Credit: 7,737,452
RAC: 1,957
Message 32738 - Posted: 9 Oct 2017, 21:27:18 UTC - in response to Message 32733.  

Ah, something's happening and it looks like it may be good news. Various logs are ticking upwards. Fingers crossed...
ID: 32738 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1811
Credit: 118,336,097
RAC: 25,785
Message 32744 - Posted: 10 Oct 2017, 4:54:32 UTC - in response to Message 32738.  

Ah, something's happening and it looks like it may be good news. Various logs are ticking upwards. Fingers crossed...

as far as I can see on my 3 PCs, everything works well again :-)
ID: 32744 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1811
Credit: 118,336,097
RAC: 25,785
Message 32805 - Posted: 12 Oct 2017, 18:19:59 UTC

Any idea why the number of running CMS jobs has been falling that drastically in the past few hours, as seen from this chart:

https://lhcathomedev.cern.ch/lhcathome-dev/cms_job.php

Yesterday, the number almost reached 1200, now it's at 600.
ID: 32805 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1060
Credit: 7,737,452
RAC: 1,957
Message 32821 - Posted: 13 Oct 2017, 9:04:18 UTC - in response to Message 32805.  

Any idea why the number of running CMS jobs has been falling that drastically in the past few hours, as seen from this chart:

https://lhcathomedev.cern.ch/lhcathome-dev/cms_job.php

Yesterday, the number almost reached 1200, now it's at 600.

I haven't found a reason for that. It's going back up again now. Looks like a large section of machines weren't getting new jobs and that's cleared now. There is a big spike in the squid proxy traffic from when jobs started running again. There was an increase in Test4Theory jobs at the same time (0930 CERN time) so I'm surmising something in a CERN machine that needed a tweak.
ID: 32821 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1811
Credit: 118,336,097
RAC: 25,785
Message 32839 - Posted: 16 Oct 2017, 5:55:07 UTC

Last night and the night before I had cases where a task errored out after 2 minutes.
Stderr shows the following:

2017-10-16 05:36:35 (4664): Guest Log: [DEBUG] Testing VCCS connection to vccs.cern.ch on port 443
2017-10-16 05:36:35 (4664): Guest Log: [DEBUG] Connection to vccs.cern.ch 443 port [tcp/https] succeeded!
2017-10-16 05:36:35 (4664): Guest Log: [DEBUG] 0
2017-10-16 05:36:35 (4664): Guest Log: [DEBUG] Testing connection to Condor server on port 9618
2017-10-16 05:36:55 (4664): VM Completion File Detected.
2017-10-16 05:36:55 (4664): VM Completion Message: Could not connect to Condor server on port 9618


what's going wrong? Any problems with Condor server?
ID: 32839 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1060
Credit: 7,737,452
RAC: 1,957
Message 32842 - Posted: 16 Oct 2017, 9:51:25 UTC - in response to Message 32839.  

Last night and the night before I had cases where a task errored out after 2 minutes.
Stderr shows the following:

2017-10-16 05:36:35 (4664): Guest Log: [DEBUG] Testing VCCS connection to vccs.cern.ch on port 443
2017-10-16 05:36:35 (4664): Guest Log: [DEBUG] Connection to vccs.cern.ch 443 port [tcp/https] succeeded!
2017-10-16 05:36:35 (4664): Guest Log: [DEBUG] 0
2017-10-16 05:36:35 (4664): Guest Log: [DEBUG] Testing connection to Condor server on port 9618
2017-10-16 05:36:55 (4664): VM Completion File Detected.
2017-10-16 05:36:55 (4664): VM Completion Message: Could not connect to Condor server on port 9618


what's going wrong? Any problems with Condor server?

Not that I'm aware of. I don't have any failed tasks, and my monitors show nothing amiss -- except that Theory ran out of jobs and ~400 machines switched to running CMS jobs instead. Which is nice... I'd suggest it's a "local" problem, check if you or your ISP have made any changes to firewall rules, etc., lately.
ID: 32842 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1060
Credit: 7,737,452
RAC: 1,957
Message 32846 - Posted: 16 Oct 2017, 20:07:05 UTC - in response to Message 32842.  

We now appear to be losing machines back to Theory as CMS tasks reach their 12-18 hour life limit.
ID: 32846 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2242
Credit: 173,902,375
RAC: 2,798
Message 32851 - Posted: 17 Oct 2017, 9:37:12 UTC
Last modified: 17 Oct 2017, 9:39:30 UTC

Have CMS and Theory active in preferences of LHCatHome on a Server.

The work which is downloadíng change every time a task is finished from CMS to Theory or Theory to CMS.
This is a good timing, without doing no more watching. It is working well.

https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10474793
ID: 32851 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1811
Credit: 118,336,097
RAC: 25,785
Message 32856 - Posted: 19 Oct 2017, 13:52:44 UTC - in response to Message 32842.  

what's going wrong? Any problems with Condor server?

Not that I'm aware of. I don't have any failed tasks, and my monitors show nothing amiss -- except that Theory ran out of jobs and ~400 machines switched to running CMS jobs instead. Which is nice... I'd suggest it's a "local" problem, check if you or your ISP have made any changes to firewall rules, etc., lately.

hm, last evening the same thing happened again. Only once though, all other jobs ran okay.
A check with my ISP yields that no changes were done there.
ID: 32856 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1060
Credit: 7,737,452
RAC: 1,957
Message 32857 - Posted: 19 Oct 2017, 14:03:35 UTC

We seem to be having a high rate of stage-out errors at the moment. I'll let CERN know, and cross my fingers that it's something transient.
ID: 32857 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1060
Credit: 7,737,452
RAC: 1,957
Message 32859 - Posted: 20 Oct 2017, 9:50:21 UTC - in response to Message 32857.  
Last modified: 20 Oct 2017, 9:51:06 UTC

We seem to be having a high rate of stage-out errors at the moment. I'll let CERN know, and cross my fingers that it's something transient.

This appears to have been a CEPH file-store problem (Data Bridge uses CEPH).
We have had one major issue this morning around 11:00 - the CEPH gateways pretty much all crashed within one hour, due to running out of file descriptors (this is a configuration issue - we balance the memory needs of both Xrootd and CEPH against the expected concurrency, and got it wrong). They all were promptly restarted, but it looks like in this case CASTOR "forgets" to release the transfer slots assigned to xrootd. Which means that the pool was scheduling new transfers only very slowly.
I have cleaned up these stale transfers at around 18:56, and see that throughput has gone up.

ID: 32859 · Report as offensive     Reply Quote
PHILIPPE

Send message
Joined: 24 Jul 16
Posts: 88
Credit: 239,917
RAC: 0
Message 32860 - Posted: 20 Oct 2017, 15:08:41 UTC - in response to Message 32859.  

just for information :

i had this error on this task:
2017-10-19 19:41:07 (3932): VM Completion Message: Could not connect to Condor server on port 9618

It occured just after a reboot done after a big update of windows.
I have windows home version but perhaps other versions are concerned too.
My new image windows is now:
Microsoft Windows 10

Core x64 Edition, (10.00.16299.00)
ID: 32860 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1811
Credit: 118,336,097
RAC: 25,785
Message 32861 - Posted: 20 Oct 2017, 16:58:53 UTC

hm, so it seems that there may indeed be some kind of problem with the Condor Server - which occurs not too often, but once in a while
ID: 32861 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1060
Credit: 7,737,452
RAC: 1,957
Message 32864 - Posted: 20 Oct 2017, 22:35:16 UTC - in response to Message 32861.  

hm, so it seems that there may indeed be some kind of problem with the Condor Server - which occurs not too often, but once in a while

There is, still, a very big [authentication] problem with the Condor server. However, Volunteer jobs should not be communicating with it.
tl;dr: what communicates with Condor is the log-merge processes, and these should only run on CMS resources. If they try to run on Volunteer hosts, we really need to look into it.
We are trying to solve these remaining problems, but the scattered and disparate nature of the people who need to be involved is a drawback. Northern hemisphere summer was a problem, due to holidays. I'd like it to be fixed soon but, you know, winter and Christmas...
ID: 32864 · Report as offensive     Reply Quote
Previous · 1 . . . 6 · 7 · 8 · 9 · 10 · 11 · 12 . . . 22 · Next

Message boards : CMS Application : CMS Tasks Failing


©2024 CERN