Message boards : CMS Application : CMS Tasks Failing
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 22 · Next

AuthorMessage
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 998
Credit: 6,264,307
RAC: 71
Message 30319 - Posted: 12 May 2017, 23:56:00 UTC - in response to Message 30318.  

LHCb is back too

Great. As far as I know they're not that connected, so let's put it down to coincidence for now.
ID: 30319 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,401,083
RAC: 102,312
Message 30321 - Posted: 13 May 2017, 4:29:14 UTC - in response to Message 30315.  

The CMS WMAgent is down

what catches my eye is that this WMAgent has been down quite often in the (recent) past - what is the reason? Misconfiguration?
ID: 30321 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 998
Credit: 6,264,307
RAC: 71
Message 30323 - Posted: 13 May 2017, 7:14:07 UTC - in response to Message 30321.  

The CMS WMAgent is down

what catches my eye is that this WMAgent has been down quite often in the (recent) past - what is the reason? Misconfiguration?

I really don't know, it's not my area at all. I do know (I think) that what I call "WMAgent" is a VM running some sort of master server and a number of "components" that each have a job to do -- job creation, job scheduling, accounting, etc. Usually it's one of these components that fail, and sometimes they don't have any direct effect on the job queuing and retrieval (an accounting component failed a while back; jobs still flowed but the graphs disn't update). Often the component can be restarted without affecting the rest. Last night it seemed to be a general failure.
That said, it does seem to happen too often after some part of the system has been upgraded, and that does seem to happen too often on a Friday night, or sometimes at weekends. I'll try to remember to give someone a gentle nudge next week, maybe I can work it into a talk I'm giving to the Collaboration as a whole next Thursday. :-)
ID: 30323 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2071
Credit: 156,150,336
RAC: 105,728
Message 30324 - Posted: 13 May 2017, 7:45:56 UTC - in response to Message 30321.  
Last modified: 13 May 2017, 7:47:46 UTC

what catches my eye is that this WMAgent has been down quite often in the (recent) past - what is the reason? Misconfiguration?


In IT you can minimize the errors and interrupts,
but never reach 100%.
This is my experience from more than 30 years.

Cern-IT is doing a great job.
ID: 30324 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 674
Credit: 43,152,620
RAC: 15,489
Message 30325 - Posted: 13 May 2017, 8:28:59 UTC

It would be nice if the status of all these servers for VM jobs and the number of available jobs could be visible on the server status page. This of course should be done for all apps, not just CMS.
ID: 30325 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,401,083
RAC: 102,312
Message 30326 - Posted: 13 May 2017, 12:41:51 UTC - in response to Message 30325.  

It would be nice if the status of all these servers for VM jobs and the number of available jobs could be visible on the server status page. This of course should be done for all apps, not just CMS.

I fully agree - because what is shown on the server status page now is actually misleading Information :-(
ID: 30326 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,401,083
RAC: 102,312
Message 30327 - Posted: 13 May 2017, 12:43:45 UTC - in response to Message 30324.  

what catches my eye is that this WMAgent has been down quite often in the (recent) past - what is the reason? Misconfiguration?


In IT you can minimize the errors and interrupts,
but never reach 100%.

I fully agree - but yet the WMAgent seems to fail quite often. So I am just curious what's behind this.
ID: 30327 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,401,083
RAC: 102,312
Message 30508 - Posted: 27 May 2017, 4:30:19 UTC

Since last night, all tasks failing after 10-12 minutes.
What's the Problem?
ID: 30508 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1268
Credit: 8,421,616
RAC: 2,139
Message 30509 - Posted: 27 May 2017, 5:15:53 UTC - in response to Message 30508.  

Since last night, all tasks failing after 10-12 minutes.
What's the Problem?

No CMS jobs available
ID: 30509 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,401,083
RAC: 102,312
Message 30511 - Posted: 27 May 2017, 5:30:24 UTC - in response to Message 30509.  

No CMS jobs available

any idea when new Jobs will be available again? I guess NOT before next week?
ID: 30511 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 998
Credit: 6,264,307
RAC: 71
Message 30515 - Posted: 27 May 2017, 7:05:46 UTC - in response to Message 30511.  

Sorry, just woke up to find this. Can't see what the problem is yet, it doesn't appear to be the WMAgent this time. Could take a while to fix, it was a holiday at CERN on Thursday, so many people turn that into a four-day weekend. (It's a long weekend here in the UK, but the important people are at CERN.)
ID: 30515 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,960,043
RAC: 136,926
Message 30520 - Posted: 27 May 2017, 11:01:44 UTC

Shouldn't it be time to stop the WU generation to avoid EXIT_INIT_FAILURES?
According to the server status page the #WUs is still high.
According to Laurence's post it should be possible to stop it.
ID: 30520 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 998
Credit: 6,264,307
RAC: 71
Message 30521 - Posted: 27 May 2017, 12:11:01 UTC - in response to Message 30520.  

Shouldn't it be time to stop the WU generation to avoid EXIT_INIT_FAILURES?
According to the server status page the #WUs is still high.
According to Laurence's post it should be possible to stop it.

We're running again; we had a backlog of merge jobs that filled the queue.
Message from Laurence, "Next week I will implement the automatic shutdown of tasks as a priority."
ID: 30521 · Report as offensive     Reply Quote
Profile Michael H.W. Weber

Send message
Joined: 18 Sep 04
Posts: 30
Credit: 5,100,929
RAC: 0
Message 30523 - Posted: 27 May 2017, 13:24:59 UTC
Last modified: 27 May 2017, 13:25:51 UTC

Two CMS tasks failing after 10 min; each on a different host, once Windows, once Linux:

https://lhcathome.cern.ch/lhcathome/result.php?resultid=144126716
https://lhcathome.cern.ch/lhcathome/result.php?resultid=144113647

Michael.
ID: 30523 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,401,083
RAC: 102,312
Message 30526 - Posted: 27 May 2017, 16:57:53 UTC - in response to Message 30523.  

also here: "[ERROR] Condor exited after 609s without running a Job"

complete report can be seen:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=144129
ID: 30526 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 998
Credit: 6,264,307
RAC: 71
Message 30527 - Posted: 27 May 2017, 19:58:03 UTC - in response to Message 30526.  
Last modified: 27 May 2017, 19:59:14 UTC

also here: "[ERROR] Condor exited after 609s without running a Job"

complete report can be seen:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=144129

Unfortunately that URL was incomplete; looks like it should have been https://lhcathome.cern.ch/lhcathome/result.php?resultid=144129213 . Things look OK our end, though there is a large oscillation in the number of jobs after an outrage such as this, that we have yet to find an explanation for. It damps down eventually. Other tasks on that machine seem to be getting jobs, so unless it continues to happen I'd suggest regarding it as a one-off glitch.
Looking at your machine details, I'd also suggest that running ten 2-GB VMs on a 32 GB machine might be pushing the limits. Check your memory usage with Task Manager. I'd also strongly suggest a Windows Upgrade if you can; XP is beyond useful life and supported only in exceptional security circumstances (such as the ransomware earlier this month). Unless you've hacked the registry to make Microsoft think it's a PoS terminal...
ID: 30527 · Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer

Send message
Joined: 20 Jun 14
Posts: 372
Credit: 238,712
RAC: 0
Message 30543 - Posted: 29 May 2017, 12:58:23 UTC - in response to Message 30521.  
Last modified: 29 May 2017, 14:29:20 UTC

Message from Laurence, "Next week I will implement the automatic shutdown of tasks as a priority."


An automated kill switch has been implemented. The sending of tasks should be stopped when the queue is too low. We will find out if this works the next time there is a problem.
ID: 30543 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 998
Credit: 6,264,307
RAC: 71
Message 30546 - Posted: 29 May 2017, 15:56:54 UTC - in response to Message 30543.  

Message from Laurence, "Next week I will implement the automatic shutdown of tasks as a priority."


An automated kill switch has been implemented. The sending of tasks should be stopped when the queue is too low. We will find out if this works the next time there is a problem.

Thanks, Laurence.
ID: 30546 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 998
Credit: 6,264,307
RAC: 71
Message 30549 - Posted: 29 May 2017, 18:28:40 UTC

I've just spotted that our WMAgent has problems again, and the queue is depleting. Best take evasive action if necessary until we see if it gets fixed. There were another two down when I first noticed it, but they are back again so maybe someone is already working on it (my monitors only update every five minutes).
ID: 30549 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 998
Credit: 6,264,307
RAC: 71
Message 30551 - Posted: 29 May 2017, 19:24:23 UTC - in response to Message 30549.  

I've just spotted that our WMAgent has problems again, and the queue is depleting. Best take evasive action if necessary until we see if it gets fixed. There were another two down when I first noticed it, but they are back again so maybe someone is already working on it (my monitors only update every five minutes).

No change yet:

agent: vocms0159.cern.ch (1.1.2.patch2)
agent last updated: 2017/5/29 (Mon) 19:20:19 UTC : 0 h 0 m
data last updated: N/A
status: Components or Thread down;
team: testbed-vocms0159
ID: 30551 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 22 · Next

Message boards : CMS Application : CMS Tasks Failing


©2024 CERN