Message boards :
CMS Application :
CMS Tasks Failing
Message board moderation
Previous · 1 . . . 6 · 7 · 8 · 9 · 10 · 11 · 12 . . . 22 · Next
Author | Message |
---|---|
Send message Joined: 29 Aug 05 Posts: 1060 Credit: 7,737,452 RAC: 1,957 |
|
Send message Joined: 18 Dec 15 Posts: 1811 Credit: 118,336,097 RAC: 25,785 |
Thanks, Ivan, for keeping us posted :-) |
Send message Joined: 29 Aug 05 Posts: 1060 Credit: 7,737,452 RAC: 1,957 |
|
Send message Joined: 27 Sep 08 Posts: 847 Credit: 691,214,017 RAC: 104,702 |
looks the all the projects fell over not just CMS |
Send message Joined: 20 Jun 14 Posts: 380 Credit: 238,712 RAC: 0 |
Hopefully back now. |
Send message Joined: 27 Sep 08 Posts: 847 Credit: 691,214,017 RAC: 104,702 |
Thanks Laurence, got to 12min so should be good. I took the oppertunity to upgrade VBox so not bad :) |
Send message Joined: 29 Aug 05 Posts: 1060 Credit: 7,737,452 RAC: 1,957 |
|
Send message Joined: 18 Dec 15 Posts: 1811 Credit: 118,336,097 RAC: 25,785 |
Ah, something's happening and it looks like it may be good news. Various logs are ticking upwards. Fingers crossed... as far as I can see on my 3 PCs, everything works well again :-) |
Send message Joined: 18 Dec 15 Posts: 1811 Credit: 118,336,097 RAC: 25,785 |
Any idea why the number of running CMS jobs has been falling that drastically in the past few hours, as seen from this chart: https://lhcathomedev.cern.ch/lhcathome-dev/cms_job.php Yesterday, the number almost reached 1200, now it's at 600. |
Send message Joined: 29 Aug 05 Posts: 1060 Credit: 7,737,452 RAC: 1,957 |
Any idea why the number of running CMS jobs has been falling that drastically in the past few hours, as seen from this chart: I haven't found a reason for that. It's going back up again now. Looks like a large section of machines weren't getting new jobs and that's cleared now. There is a big spike in the squid proxy traffic from when jobs started running again. There was an increase in Test4Theory jobs at the same time (0930 CERN time) so I'm surmising something in a CERN machine that needed a tweak. |
Send message Joined: 18 Dec 15 Posts: 1811 Credit: 118,336,097 RAC: 25,785 |
Last night and the night before I had cases where a task errored out after 2 minutes. Stderr shows the following: 2017-10-16 05:36:35 (4664): Guest Log: [DEBUG] Testing VCCS connection to vccs.cern.ch on port 443 2017-10-16 05:36:35 (4664): Guest Log: [DEBUG] Connection to vccs.cern.ch 443 port [tcp/https] succeeded! 2017-10-16 05:36:35 (4664): Guest Log: [DEBUG] 0 2017-10-16 05:36:35 (4664): Guest Log: [DEBUG] Testing connection to Condor server on port 9618 2017-10-16 05:36:55 (4664): VM Completion File Detected. 2017-10-16 05:36:55 (4664): VM Completion Message: Could not connect to Condor server on port 9618 what's going wrong? Any problems with Condor server? |
Send message Joined: 29 Aug 05 Posts: 1060 Credit: 7,737,452 RAC: 1,957 |
Last night and the night before I had cases where a task errored out after 2 minutes. Not that I'm aware of. I don't have any failed tasks, and my monitors show nothing amiss -- except that Theory ran out of jobs and ~400 machines switched to running CMS jobs instead. Which is nice... I'd suggest it's a "local" problem, check if you or your ISP have made any changes to firewall rules, etc., lately. |
Send message Joined: 29 Aug 05 Posts: 1060 Credit: 7,737,452 RAC: 1,957 |
|
Send message Joined: 2 May 07 Posts: 2242 Credit: 173,902,375 RAC: 2,798 |
Have CMS and Theory active in preferences of LHCatHome on a Server. The work which is downloadÃng change every time a task is finished from CMS to Theory or Theory to CMS. This is a good timing, without doing no more watching. It is working well. https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10474793 |
Send message Joined: 18 Dec 15 Posts: 1811 Credit: 118,336,097 RAC: 25,785 |
what's going wrong? Any problems with Condor server? Not that I'm aware of. I don't have any failed tasks, and my monitors show nothing amiss -- except that Theory ran out of jobs and ~400 machines switched to running CMS jobs instead. Which is nice... I'd suggest it's a "local" problem, check if you or your ISP have made any changes to firewall rules, etc., lately. hm, last evening the same thing happened again. Only once though, all other jobs ran okay. A check with my ISP yields that no changes were done there. |
Send message Joined: 29 Aug 05 Posts: 1060 Credit: 7,737,452 RAC: 1,957 |
|
Send message Joined: 29 Aug 05 Posts: 1060 Credit: 7,737,452 RAC: 1,957 |
We seem to be having a high rate of stage-out errors at the moment. I'll let CERN know, and cross my fingers that it's something transient. This appears to have been a CEPH file-store problem (Data Bridge uses CEPH). We have had one major issue this morning around 11:00 - the CEPH gateways pretty much all crashed within one hour, due to running out of file descriptors (this is a configuration issue - we balance the memory needs of both Xrootd and CEPH against the expected concurrency, and got it wrong). They all were promptly restarted, but it looks like in this case CASTOR "forgets" to release the transfer slots assigned to xrootd. Which means that the pool was scheduling new transfers only very slowly. |
Send message Joined: 24 Jul 16 Posts: 88 Credit: 239,917 RAC: 0 |
just for information : i had this error on this task: 2017-10-19 19:41:07 (3932): VM Completion Message: Could not connect to Condor server on port 9618 It occured just after a reboot done after a big update of windows. I have windows home version but perhaps other versions are concerned too. My new image windows is now: Microsoft Windows 10 |
Send message Joined: 18 Dec 15 Posts: 1811 Credit: 118,336,097 RAC: 25,785 |
hm, so it seems that there may indeed be some kind of problem with the Condor Server - which occurs not too often, but once in a while |
Send message Joined: 29 Aug 05 Posts: 1060 Credit: 7,737,452 RAC: 1,957 |
hm, so it seems that there may indeed be some kind of problem with the Condor Server - which occurs not too often, but once in a while There is, still, a very big [authentication] problem with the Condor server. However, Volunteer jobs should not be communicating with it. tl;dr: what communicates with Condor is the log-merge processes, and these should only run on CMS resources. If they try to run on Volunteer hosts, we really need to look into it. We are trying to solve these remaining problems, but the scattered and disparate nature of the people who need to be involved is a drawback. Northern hemisphere summer was a problem, due to holidays. I'd like it to be fixed soon but, you know, winter and Christmas... |
©2024 CERN