Message boards :
CMS Application :
CMS Tasks Failing
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 22 · Next
Author | Message |
---|---|
Send message Joined: 29 Aug 05 Posts: 1060 Credit: 7,737,455 RAC: 1,317 |
|
Send message Joined: 18 Dec 15 Posts: 1814 Credit: 118,470,302 RAC: 30,049 |
The CMS WMAgent is down what catches my eye is that this WMAgent has been down quite often in the (recent) past - what is the reason? Misconfiguration? |
Send message Joined: 29 Aug 05 Posts: 1060 Credit: 7,737,455 RAC: 1,317 |
The CMS WMAgent is down I really don't know, it's not my area at all. I do know (I think) that what I call "WMAgent" is a VM running some sort of master server and a number of "components" that each have a job to do -- job creation, job scheduling, accounting, etc. Usually it's one of these components that fail, and sometimes they don't have any direct effect on the job queuing and retrieval (an accounting component failed a while back; jobs still flowed but the graphs disn't update). Often the component can be restarted without affecting the rest. Last night it seemed to be a general failure. That said, it does seem to happen too often after some part of the system has been upgraded, and that does seem to happen too often on a Friday night, or sometimes at weekends. I'll try to remember to give someone a gentle nudge next week, maybe I can work it into a talk I'm giving to the Collaboration as a whole next Thursday. :-) |
Send message Joined: 2 May 07 Posts: 2243 Credit: 173,902,375 RAC: 2,013 |
what catches my eye is that this WMAgent has been down quite often in the (recent) past - what is the reason? Misconfiguration? In IT you can minimize the errors and interrupts, but never reach 100%. This is my experience from more than 30 years. Cern-IT is doing a great job. |
Send message Joined: 28 Sep 04 Posts: 728 Credit: 49,034,391 RAC: 27,290 |
It would be nice if the status of all these servers for VM jobs and the number of available jobs could be visible on the server status page. This of course should be done for all apps, not just CMS. |
Send message Joined: 18 Dec 15 Posts: 1814 Credit: 118,470,302 RAC: 30,049 |
It would be nice if the status of all these servers for VM jobs and the number of available jobs could be visible on the server status page. This of course should be done for all apps, not just CMS. I fully agree - because what is shown on the server status page now is actually misleading Information :-( |
Send message Joined: 18 Dec 15 Posts: 1814 Credit: 118,470,302 RAC: 30,049 |
what catches my eye is that this WMAgent has been down quite often in the (recent) past - what is the reason? Misconfiguration? I fully agree - but yet the WMAgent seems to fail quite often. So I am just curious what's behind this. |
Send message Joined: 18 Dec 15 Posts: 1814 Credit: 118,470,302 RAC: 30,049 |
Since last night, all tasks failing after 10-12 minutes. What's the Problem? |
Send message Joined: 14 Jan 10 Posts: 1418 Credit: 9,460,759 RAC: 2,399 |
|
Send message Joined: 18 Dec 15 Posts: 1814 Credit: 118,470,302 RAC: 30,049 |
No CMS jobs available any idea when new Jobs will be available again? I guess NOT before next week? |
Send message Joined: 29 Aug 05 Posts: 1060 Credit: 7,737,455 RAC: 1,317 |
Sorry, just woke up to find this. Can't see what the problem is yet, it doesn't appear to be the WMAgent this time. Could take a while to fix, it was a holiday at CERN on Thursday, so many people turn that into a four-day weekend. (It's a long weekend here in the UK, but the important people are at CERN.) |
Send message Joined: 15 Jun 08 Posts: 2534 Credit: 253,832,787 RAC: 37,309 |
Shouldn't it be time to stop the WU generation to avoid EXIT_INIT_FAILURES? According to the server status page the #WUs is still high. According to Laurence's post it should be possible to stop it. |
Send message Joined: 29 Aug 05 Posts: 1060 Credit: 7,737,455 RAC: 1,317 |
Shouldn't it be time to stop the WU generation to avoid EXIT_INIT_FAILURES? We're running again; we had a backlog of merge jobs that filled the queue. Message from Laurence, "Next week I will implement the automatic shutdown of tasks as a priority." |
Send message Joined: 18 Sep 04 Posts: 30 Credit: 5,100,929 RAC: 0 |
Two CMS tasks failing after 10 min; each on a different host, once Windows, once Linux: https://lhcathome.cern.ch/lhcathome/result.php?resultid=144126716 https://lhcathome.cern.ch/lhcathome/result.php?resultid=144113647 Michael. |
Send message Joined: 18 Dec 15 Posts: 1814 Credit: 118,470,302 RAC: 30,049 |
also here: "[ERROR] Condor exited after 609s without running a Job" complete report can be seen: https://lhcathome.cern.ch/lhcathome/result.php?resultid=144129 |
Send message Joined: 29 Aug 05 Posts: 1060 Credit: 7,737,455 RAC: 1,317 |
also here: "[ERROR] Condor exited after 609s without running a Job" Unfortunately that URL was incomplete; looks like it should have been https://lhcathome.cern.ch/lhcathome/result.php?resultid=144129213 . Things look OK our end, though there is a large oscillation in the number of jobs after an outrage such as this, that we have yet to find an explanation for. It damps down eventually. Other tasks on that machine seem to be getting jobs, so unless it continues to happen I'd suggest regarding it as a one-off glitch. Looking at your machine details, I'd also suggest that running ten 2-GB VMs on a 32 GB machine might be pushing the limits. Check your memory usage with Task Manager. I'd also strongly suggest a Windows Upgrade if you can; XP is beyond useful life and supported only in exceptional security circumstances (such as the ransomware earlier this month). Unless you've hacked the registry to make Microsoft think it's a PoS terminal... |
Send message Joined: 20 Jun 14 Posts: 380 Credit: 238,712 RAC: 0 |
Message from Laurence, "Next week I will implement the automatic shutdown of tasks as a priority." An automated kill switch has been implemented. The sending of tasks should be stopped when the queue is too low. We will find out if this works the next time there is a problem. |
Send message Joined: 29 Aug 05 Posts: 1060 Credit: 7,737,455 RAC: 1,317 |
|
Send message Joined: 29 Aug 05 Posts: 1060 Credit: 7,737,455 RAC: 1,317 |
I've just spotted that our WMAgent has problems again, and the queue is depleting. Best take evasive action if necessary until we see if it gets fixed. There were another two down when I first noticed it, but they are back again so maybe someone is already working on it (my monitors only update every five minutes). |
Send message Joined: 29 Aug 05 Posts: 1060 Credit: 7,737,455 RAC: 1,317 |
I've just spotted that our WMAgent has problems again, and the queue is depleting. Best take evasive action if necessary until we see if it gets fixed. There were another two down when I first noticed it, but they are back again so maybe someone is already working on it (my monitors only update every five minutes). No change yet: agent: vocms0159.cern.ch (1.1.2.patch2) agent last updated: 2017/5/29 (Mon) 19:20:19 UTC : 0 h 0 m data last updated: N/A status: Components or Thread down; team: testbed-vocms0159 |
©2024 CERN