Thread 'CMS Tasks Failing'

Author	Message
ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1163 Credit: 11,924,350 RAC: 7,468	Message 30319 - Posted: 12 May 2017, 23:56:00 UTC - in response to Message 30318. LHCb is back too Great. As far as I know they're not that connected, so let's put it down to coincidence for now. ID: 30319 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1989 Credit: 162,853,170 RAC: 92,130	Message 30321 - Posted: 13 May 2017, 4:29:14 UTC - in response to Message 30315. The CMS WMAgent is down what catches my eye is that this WMAgent has been down quite often in the (recent) past - what is the reason? Misconfiguration? ID: 30321 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1163 Credit: 11,924,350 RAC: 7,468	Message 30323 - Posted: 13 May 2017, 7:14:07 UTC - in response to Message 30321. The CMS WMAgent is down what catches my eye is that this WMAgent has been down quite often in the (recent) past - what is the reason? Misconfiguration? I really don't know, it's not my area at all. I do know (I think) that what I call "WMAgent" is a VM running some sort of master server and a number of "components" that each have a job to do -- job creation, job scheduling, accounting, etc. Usually it's one of these components that fail, and sometimes they don't have any direct effect on the job queuing and retrieval (an accounting component failed a while back; jobs still flowed but the graphs disn't update). Often the component can be restarted without affecting the rest. Last night it seemed to be a general failure. That said, it does seem to happen too often after some part of the system has been upgraded, and that does seem to happen too often on a Friday night, or sometimes at weekends. I'll try to remember to give someone a gentle nudge next week, maybe I can work it into a talk I'm giving to the Collaboration as a whole next Thursday. :-) ID: 30323 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2305 Credit: 179,727,092 RAC: 8,570	Message 30324 - Posted: 13 May 2017, 7:45:56 UTC - in response to Message 30321. Last modified: 13 May 2017, 7:47:46 UTC what catches my eye is that this WMAgent has been down quite often in the (recent) past - what is the reason? Misconfiguration? In IT you can minimize the errors and interrupts, but never reach 100%. This is my experience from more than 30 years. Cern-IT is doing a great job. ID: 30324 · Reply Quote

Harri Liljeroos Send message Joined: 28 Sep 04 Posts: 810 Credit: 66,183,029 RAC: 22,468	Message 30325 - Posted: 13 May 2017, 8:28:59 UTC It would be nice if the status of all these servers for VM jobs and the number of available jobs could be visible on the server status page. This of course should be done for all apps, not just CMS. ID: 30325 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1989 Credit: 162,853,170 RAC: 92,130	Message 30326 - Posted: 13 May 2017, 12:41:51 UTC - in response to Message 30325. It would be nice if the status of all these servers for VM jobs and the number of available jobs could be visible on the server status page. This of course should be done for all apps, not just CMS. I fully agree - because what is shown on the server status page now is actually misleading Information :-( ID: 30326 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1989 Credit: 162,853,170 RAC: 92,130	Message 30327 - Posted: 13 May 2017, 12:43:45 UTC - in response to Message 30324. what catches my eye is that this WMAgent has been down quite often in the (recent) past - what is the reason? Misconfiguration? In IT you can minimize the errors and interrupts, but never reach 100%. I fully agree - but yet the WMAgent seems to fail quite often. So I am just curious what's behind this. ID: 30327 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1989 Credit: 162,853,170 RAC: 92,130	Message 30508 - Posted: 27 May 2017, 4:30:19 UTC Since last night, all tasks failing after 10-12 minutes. What's the Problem? ID: 30508 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1559 Credit: 10,102,496 RAC: 691	Message 30509 - Posted: 27 May 2017, 5:15:53 UTC - in response to Message 30508. Since last night, all tasks failing after 10-12 minutes. What's the Problem? No CMS jobs available ID: 30509 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1989 Credit: 162,853,170 RAC: 92,130	Message 30511 - Posted: 27 May 2017, 5:30:24 UTC - in response to Message 30509. No CMS jobs available any idea when new Jobs will be available again? I guess NOT before next week? ID: 30511 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1163 Credit: 11,924,350 RAC: 7,468	Message 30515 - Posted: 27 May 2017, 7:05:46 UTC - in response to Message 30511. Sorry, just woke up to find this. Can't see what the problem is yet, it doesn't appear to be the WMAgent this time. Could take a while to fix, it was a holiday at CERN on Thursday, so many people turn that into a four-day weekend. (It's a long weekend here in the UK, but the important people are at CERN.) ID: 30515 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2760 Credit: 305,087,738 RAC: 114,432	Message 30520 - Posted: 27 May 2017, 11:01:44 UTC Shouldn't it be time to stop the WU generation to avoid EXIT_INIT_FAILURES? According to the server status page the #WUs is still high. According to Laurence's post it should be possible to stop it. ID: 30520 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1163 Credit: 11,924,350 RAC: 7,468	Message 30521 - Posted: 27 May 2017, 12:11:01 UTC - in response to Message 30520. Shouldn't it be time to stop the WU generation to avoid EXIT_INIT_FAILURES? According to the server status page the #WUs is still high. According to Laurence's post it should be possible to stop it. We're running again; we had a backlog of merge jobs that filled the queue. Message from Laurence, "Next week I will implement the automatic shutdown of tasks as a priority." ID: 30521 · Reply Quote

Michael H.W. Weber Send message Joined: 18 Sep 04 Posts: 32 Credit: 5,101,128 RAC: 0	Message 30523 - Posted: 27 May 2017, 13:24:59 UTC Last modified: 27 May 2017, 13:25:51 UTC Two CMS tasks failing after 10 min; each on a different host, once Windows, once Linux: https://lhcathome.cern.ch/lhcathome/result.php?resultid=144126716 https://lhcathome.cern.ch/lhcathome/result.php?resultid=144113647 Michael. Rechenkraft.net ID: 30523 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1989 Credit: 162,853,170 RAC: 92,130	Message 30526 - Posted: 27 May 2017, 16:57:53 UTC - in response to Message 30523. also here: "[ERROR] Condor exited after 609s without running a Job" complete report can be seen: https://lhcathome.cern.ch/lhcathome/result.php?resultid=144129 ID: 30526 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1163 Credit: 11,924,350 RAC: 7,468	Message 30527 - Posted: 27 May 2017, 19:58:03 UTC - in response to Message 30526. Last modified: 27 May 2017, 19:59:14 UTC also here: "[ERROR] Condor exited after 609s without running a Job" complete report can be seen: https://lhcathome.cern.ch/lhcathome/result.php?resultid=144129 Unfortunately that URL was incomplete; looks like it should have been https://lhcathome.cern.ch/lhcathome/result.php?resultid=144129213 . Things look OK our end, though there is a large oscillation in the number of jobs after an outrage such as this, that we have yet to find an explanation for. It damps down eventually. Other tasks on that machine seem to be getting jobs, so unless it continues to happen I'd suggest regarding it as a one-off glitch. Looking at your machine details, I'd also suggest that running ten 2-GB VMs on a 32 GB machine might be pushing the limits. Check your memory usage with Task Manager. I'd also strongly suggest a Windows Upgrade if you can; XP is beyond useful life and supported only in exceptional security circumstances (such as the ransomware earlier this month). Unless you've hacked the registry to make Microsoft think it's a PoS terminal... ID: 30527 · Reply Quote

Laurence Project administrator Project developer Send message Joined: 20 Jun 14 Posts: 431 Credit: 256,248 RAC: 28	Message 30543 - Posted: 29 May 2017, 12:58:23 UTC - in response to Message 30521. Last modified: 29 May 2017, 14:29:20 UTC Message from Laurence, "Next week I will implement the automatic shutdown of tasks as a priority." An automated kill switch has been implemented. The sending of tasks should be stopped when the queue is too low. We will find out if this works the next time there is a problem. ID: 30543 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1163 Credit: 11,924,350 RAC: 7,468	Message 30546 - Posted: 29 May 2017, 15:56:54 UTC - in response to Message 30543. Message from Laurence, "Next week I will implement the automatic shutdown of tasks as a priority." An automated kill switch has been implemented. The sending of tasks should be stopped when the queue is too low. We will find out if this works the next time there is a problem. Thanks, Laurence. ID: 30546 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1163 Credit: 11,924,350 RAC: 7,468	Message 30549 - Posted: 29 May 2017, 18:28:40 UTC I've just spotted that our WMAgent has problems again, and the queue is depleting. Best take evasive action if necessary until we see if it gets fixed. There were another two down when I first noticed it, but they are back again so maybe someone is already working on it (my monitors only update every five minutes). ID: 30549 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1163 Credit: 11,924,350 RAC: 7,468	Message 30551 - Posted: 29 May 2017, 19:24:23 UTC - in response to Message 30549. I've just spotted that our WMAgent has problems again, and the queue is depleting. Best take evasive action if necessary until we see if it gets fixed. There were another two down when I first noticed it, but they are back again so maybe someone is already working on it (my monitors only update every five minutes). No change yet: agent: vocms0159.cern.ch (1.1.2.patch2) agent last updated: 2017/5/29 (Mon) 19:20:19 UTC : 0 h 0 m data last updated: N/A status: Components or Thread down; team: testbed-vocms0159 ID: 30551 · Reply Quote