Thread 'CMS Tasks Failing'

Author	Message
ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1159 Credit: 11,871,672 RAC: 7,410	Message 32727 - Posted: 9 Oct 2017, 18:15:13 UTC - in response to Message 32720. Unfortunately, we still have problems. I'll update this thread as soon as I have any more news. ID: 32727 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1986 Credit: 162,172,797 RAC: 87,016	Message 32728 - Posted: 9 Oct 2017, 18:25:12 UTC - in response to Message 32727. Thanks, Ivan, for keeping us posted :-) ID: 32728 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1159 Credit: 11,871,672 RAC: 7,410	Message 32733 - Posted: 9 Oct 2017, 20:33:46 UTC - in response to Message 32728. Cheers, Erich. Sorry it's not good news yet. On top of that I've been having broadband problems at home tonight -- "Up and down like a whore's drawers!" to put it crudely. ID: 32733 · Reply Quote

Toby Broom Volunteer moderator Send message Joined: 27 Sep 08 Posts: 946 Credit: 784,373,554 RAC: 158,973	Message 32734 - Posted: 9 Oct 2017, 20:52:54 UTC looks the all the projects fell over not just CMS ID: 32734 · Reply Quote

Laurence Project administrator Project developer Send message Joined: 20 Jun 14 Posts: 431 Credit: 256,248 RAC: 59	Message 32736 - Posted: 9 Oct 2017, 21:14:19 UTC - in response to Message 32734. Hopefully back now. ID: 32736 · Reply Quote

Toby Broom Volunteer moderator Send message Joined: 27 Sep 08 Posts: 946 Credit: 784,373,554 RAC: 158,973	Message 32737 - Posted: 9 Oct 2017, 21:23:24 UTC Thanks Laurence, got to 12min so should be good. I took the oppertunity to upgrade VBox so not bad :) ID: 32737 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1159 Credit: 11,871,672 RAC: 7,410	Message 32738 - Posted: 9 Oct 2017, 21:27:18 UTC - in response to Message 32733. Ah, something's happening and it looks like it may be good news. Various logs are ticking upwards. Fingers crossed... ID: 32738 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1986 Credit: 162,172,797 RAC: 87,016	Message 32744 - Posted: 10 Oct 2017, 4:54:32 UTC - in response to Message 32738. Ah, something's happening and it looks like it may be good news. Various logs are ticking upwards. Fingers crossed... as far as I can see on my 3 PCs, everything works well again :-) ID: 32744 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1986 Credit: 162,172,797 RAC: 87,016	Message 32805 - Posted: 12 Oct 2017, 18:19:59 UTC Any idea why the number of running CMS jobs has been falling that drastically in the past few hours, as seen from this chart: https://lhcathomedev.cern.ch/lhcathome-dev/cms_job.php Yesterday, the number almost reached 1200, now it's at 600. ID: 32805 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1159 Credit: 11,871,672 RAC: 7,410	Message 32821 - Posted: 13 Oct 2017, 9:04:18 UTC - in response to Message 32805. Any idea why the number of running CMS jobs has been falling that drastically in the past few hours, as seen from this chart: https://lhcathomedev.cern.ch/lhcathome-dev/cms_job.php Yesterday, the number almost reached 1200, now it's at 600. I haven't found a reason for that. It's going back up again now. Looks like a large section of machines weren't getting new jobs and that's cleared now. There is a big spike in the squid proxy traffic from when jobs started running again. There was an increase in Test4Theory jobs at the same time (0930 CERN time) so I'm surmising something in a CERN machine that needed a tweak. ID: 32821 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1986 Credit: 162,172,797 RAC: 87,016	Message 32839 - Posted: 16 Oct 2017, 5:55:07 UTC Last night and the night before I had cases where a task errored out after 2 minutes. Stderr shows the following: 2017-10-16 05:36:35 (4664): Guest Log: [DEBUG] Testing VCCS connection to vccs.cern.ch on port 443 2017-10-16 05:36:35 (4664): Guest Log: [DEBUG] Connection to vccs.cern.ch 443 port [tcp/https] succeeded! 2017-10-16 05:36:35 (4664): Guest Log: [DEBUG] 0 2017-10-16 05:36:35 (4664): Guest Log: [DEBUG] Testing connection to Condor server on port 9618 2017-10-16 05:36:55 (4664): VM Completion File Detected. 2017-10-16 05:36:55 (4664): VM Completion Message: Could not connect to Condor server on port 9618 what's going wrong? Any problems with Condor server? ID: 32839 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1159 Credit: 11,871,672 RAC: 7,410	Message 32842 - Posted: 16 Oct 2017, 9:51:25 UTC - in response to Message 32839. Last night and the night before I had cases where a task errored out after 2 minutes. Stderr shows the following: 2017-10-16 05:36:35 (4664): Guest Log: [DEBUG] Testing VCCS connection to vccs.cern.ch on port 443 2017-10-16 05:36:35 (4664): Guest Log: [DEBUG] Connection to vccs.cern.ch 443 port [tcp/https] succeeded! 2017-10-16 05:36:35 (4664): Guest Log: [DEBUG] 0 2017-10-16 05:36:35 (4664): Guest Log: [DEBUG] Testing connection to Condor server on port 9618 2017-10-16 05:36:55 (4664): VM Completion File Detected. 2017-10-16 05:36:55 (4664): VM Completion Message: Could not connect to Condor server on port 9618 what's going wrong? Any problems with Condor server? Not that I'm aware of. I don't have any failed tasks, and my monitors show nothing amiss -- except that Theory ran out of jobs and ~400 machines switched to running CMS jobs instead. Which is nice... I'd suggest it's a "local" problem, check if you or your ISP have made any changes to firewall rules, etc., lately. ID: 32842 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1159 Credit: 11,871,672 RAC: 7,410	Message 32846 - Posted: 16 Oct 2017, 20:07:05 UTC - in response to Message 32842. We now appear to be losing machines back to Theory as CMS tasks reach their 12-18 hour life limit. ID: 32846 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2304 Credit: 179,727,092 RAC: 17,509	Message 32851 - Posted: 17 Oct 2017, 9:37:12 UTC Last modified: 17 Oct 2017, 9:39:30 UTC Have CMS and Theory active in preferences of LHCatHome on a Server. The work which is downloadÃng change every time a task is finished from CMS to Theory or Theory to CMS. This is a good timing, without doing no more watching. It is working well. https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10474793 ID: 32851 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1986 Credit: 162,172,797 RAC: 87,016	Message 32856 - Posted: 19 Oct 2017, 13:52:44 UTC - in response to Message 32842. what's going wrong? Any problems with Condor server? Not that I'm aware of. I don't have any failed tasks, and my monitors show nothing amiss -- except that Theory ran out of jobs and ~400 machines switched to running CMS jobs instead. Which is nice... I'd suggest it's a "local" problem, check if you or your ISP have made any changes to firewall rules, etc., lately. hm, last evening the same thing happened again. Only once though, all other jobs ran okay. A check with my ISP yields that no changes were done there. ID: 32856 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1159 Credit: 11,871,672 RAC: 7,410	Message 32857 - Posted: 19 Oct 2017, 14:03:35 UTC We seem to be having a high rate of stage-out errors at the moment. I'll let CERN know, and cross my fingers that it's something transient. ID: 32857 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1159 Credit: 11,871,672 RAC: 7,410	Message 32859 - Posted: 20 Oct 2017, 9:50:21 UTC - in response to Message 32857. Last modified: 20 Oct 2017, 9:51:06 UTC We seem to be having a high rate of stage-out errors at the moment. I'll let CERN know, and cross my fingers that it's something transient. This appears to have been a CEPH file-store problem (Data Bridge uses CEPH). We have had one major issue this morning around 11:00 - the CEPH gateways pretty much all crashed within one hour, due to running out of file descriptors (this is a configuration issue - we balance the memory needs of both Xrootd and CEPH against the expected concurrency, and got it wrong). They all were promptly restarted, but it looks like in this case CASTOR "forgets" to release the transfer slots assigned to xrootd. Which means that the pool was scheduling new transfers only very slowly. I have cleaned up these stale transfers at around 18:56, and see that throughput has gone up. ID: 32859 · Reply Quote

PHILIPPE Send message Joined: 24 Jul 16 Posts: 88 Credit: 239,917 RAC: 0	Message 32860 - Posted: 20 Oct 2017, 15:08:41 UTC - in response to Message 32859. just for information : i had this error on this task: 2017-10-19 19:41:07 (3932): VM Completion Message: Could not connect to Condor server on port 9618 It occured just after a reboot done after a big update of windows. I have windows home version but perhaps other versions are concerned too. My new image windows is now: Microsoft Windows 10 Core x64 Edition, (10.00.16299.00) ID: 32860 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1986 Credit: 162,172,797 RAC: 87,016	Message 32861 - Posted: 20 Oct 2017, 16:58:53 UTC hm, so it seems that there may indeed be some kind of problem with the Condor Server - which occurs not too often, but once in a while ID: 32861 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1159 Credit: 11,871,672 RAC: 7,410	Message 32864 - Posted: 20 Oct 2017, 22:35:16 UTC - in response to Message 32861. hm, so it seems that there may indeed be some kind of problem with the Condor Server - which occurs not too often, but once in a while There is, still, a very big [authentication] problem with the Condor server. However, Volunteer jobs should not be communicating with it. tl;dr: what communicates with Condor is the log-merge processes, and these should only run on CMS resources. If they try to run on Volunteer hosts, we really need to look into it. We are trying to solve these remaining problems, but the scattered and disparate nature of the people who need to be involved is a drawback. Northern hemisphere summer was a problem, due to holidays. I'd like it to be fixed soon but, you know, winter and Christmas... ID: 32864 · Reply Quote