Message boards : CMS Application : CMS Tasks Failing
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 22 · Next

AuthorMessage
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 883
Credit: 5,852,630
RAC: 84
Message 30319 - Posted: 12 May 2017, 23:56:00 UTC - in response to Message 30318.  

LHCb is back too

Great. As far as I know they're not that connected, so let's put it down to coincidence for now.
ID: 30319 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1516
Credit: 46,265,793
RAC: 58,030
Message 30321 - Posted: 13 May 2017, 4:29:14 UTC - in response to Message 30315.  

The CMS WMAgent is down

what catches my eye is that this WMAgent has been down quite often in the (recent) past - what is the reason? Misconfiguration?
ID: 30321 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 883
Credit: 5,852,630
RAC: 84
Message 30323 - Posted: 13 May 2017, 7:14:07 UTC - in response to Message 30321.  

The CMS WMAgent is down

what catches my eye is that this WMAgent has been down quite often in the (recent) past - what is the reason? Misconfiguration?

I really don't know, it's not my area at all. I do know (I think) that what I call "WMAgent" is a VM running some sort of master server and a number of "components" that each have a job to do -- job creation, job scheduling, accounting, etc. Usually it's one of these components that fail, and sometimes they don't have any direct effect on the job queuing and retrieval (an accounting component failed a while back; jobs still flowed but the graphs disn't update). Often the component can be restarted without affecting the rest. Last night it seemed to be a general failure.
That said, it does seem to happen too often after some part of the system has been upgraded, and that does seem to happen too often on a Friday night, or sometimes at weekends. I'll try to remember to give someone a gentle nudge next week, maybe I can work it into a talk I'm giving to the Collaboration as a whole next Thursday. :-)
ID: 30323 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 1588
Credit: 65,557,270
RAC: 228,557
Message 30324 - Posted: 13 May 2017, 7:45:56 UTC - in response to Message 30321.  
Last modified: 13 May 2017, 7:47:46 UTC

what catches my eye is that this WMAgent has been down quite often in the (recent) past - what is the reason? Misconfiguration?


In IT you can minimize the errors and interrupts,
but never reach 100%.
This is my experience from more than 30 years.

Cern-IT is doing a great job.
ID: 30324 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 589
Credit: 33,853,623
RAC: 20,198
Message 30325 - Posted: 13 May 2017, 8:28:59 UTC

It would be nice if the status of all these servers for VM jobs and the number of available jobs could be visible on the server status page. This of course should be done for all apps, not just CMS.
ID: 30325 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1516
Credit: 46,265,793
RAC: 58,030
Message 30326 - Posted: 13 May 2017, 12:41:51 UTC - in response to Message 30325.  

It would be nice if the status of all these servers for VM jobs and the number of available jobs could be visible on the server status page. This of course should be done for all apps, not just CMS.

I fully agree - because what is shown on the server status page now is actually misleading Information :-(
ID: 30326 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1516
Credit: 46,265,793
RAC: 58,030
Message 30327 - Posted: 13 May 2017, 12:43:45 UTC - in response to Message 30324.  

what catches my eye is that this WMAgent has been down quite often in the (recent) past - what is the reason? Misconfiguration?


In IT you can minimize the errors and interrupts,
but never reach 100%.

I fully agree - but yet the WMAgent seems to fail quite often. So I am just curious what's behind this.
ID: 30327 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1516
Credit: 46,265,793
RAC: 58,030
Message 30508 - Posted: 27 May 2017, 4:30:19 UTC

Since last night, all tasks failing after 10-12 minutes.
What's the Problem?
ID: 30508 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1122
Credit: 6,901,022
RAC: 1,117
Message 30509 - Posted: 27 May 2017, 5:15:53 UTC - in response to Message 30508.  

Since last night, all tasks failing after 10-12 minutes.
What's the Problem?

No CMS jobs available
ID: 30509 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1516
Credit: 46,265,793
RAC: 58,030
Message 30511 - Posted: 27 May 2017, 5:30:24 UTC - in response to Message 30509.  

No CMS jobs available

any idea when new Jobs will be available again? I guess NOT before next week?
ID: 30511 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 883
Credit: 5,852,630
RAC: 84
Message 30515 - Posted: 27 May 2017, 7:05:46 UTC - in response to Message 30511.  

Sorry, just woke up to find this. Can't see what the problem is yet, it doesn't appear to be the WMAgent this time. Could take a while to fix, it was a holiday at CERN on Thursday, so many people turn that into a four-day weekend. (It's a long weekend here in the UK, but the important people are at CERN.)
ID: 30515 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jun 08
Posts: 2025
Credit: 147,919,661
RAC: 116,272
Message 30520 - Posted: 27 May 2017, 11:01:44 UTC

Shouldn't it be time to stop the WU generation to avoid EXIT_INIT_FAILURES?
According to the server status page the #WUs is still high.
According to Laurence's post it should be possible to stop it.
ID: 30520 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 883
Credit: 5,852,630
RAC: 84
Message 30521 - Posted: 27 May 2017, 12:11:01 UTC - in response to Message 30520.  

Shouldn't it be time to stop the WU generation to avoid EXIT_INIT_FAILURES?
According to the server status page the #WUs is still high.
According to Laurence's post it should be possible to stop it.

We're running again; we had a backlog of merge jobs that filled the queue.
Message from Laurence, "Next week I will implement the automatic shutdown of tasks as a priority."
ID: 30521 · Report as offensive     Reply Quote
Profile Michael H.W. Weber

Send message
Joined: 18 Sep 04
Posts: 30
Credit: 5,100,929
RAC: 0
Message 30523 - Posted: 27 May 2017, 13:24:59 UTC
Last modified: 27 May 2017, 13:25:51 UTC

Two CMS tasks failing after 10 min; each on a different host, once Windows, once Linux:

https://lhcathome.cern.ch/lhcathome/result.php?resultid=144126716
https://lhcathome.cern.ch/lhcathome/result.php?resultid=144113647

Michael.
ID: 30523 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1516
Credit: 46,265,793
RAC: 58,030
Message 30526 - Posted: 27 May 2017, 16:57:53 UTC - in response to Message 30523.  

also here: "[ERROR] Condor exited after 609s without running a Job"

complete report can be seen:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=144129
ID: 30526 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 883
Credit: 5,852,630
RAC: 84
Message 30527 - Posted: 27 May 2017, 19:58:03 UTC - in response to Message 30526.  
Last modified: 27 May 2017, 19:59:14 UTC

also here: "[ERROR] Condor exited after 609s without running a Job"

complete report can be seen:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=144129

Unfortunately that URL was incomplete; looks like it should have been https://lhcathome.cern.ch/lhcathome/result.php?resultid=144129213 . Things look OK our end, though there is a large oscillation in the number of jobs after an outrage such as this, that we have yet to find an explanation for. It damps down eventually. Other tasks on that machine seem to be getting jobs, so unless it continues to happen I'd suggest regarding it as a one-off glitch.
Looking at your machine details, I'd also suggest that running ten 2-GB VMs on a 32 GB machine might be pushing the limits. Check your memory usage with Task Manager. I'd also strongly suggest a Windows Upgrade if you can; XP is beyond useful life and supported only in exceptional security circumstances (such as the ransomware earlier this month). Unless you've hacked the registry to make Microsoft think it's a PoS terminal...
ID: 30527 · Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer

Send message
Joined: 20 Jun 14
Posts: 350
Credit: 238,395
RAC: 0
Message 30543 - Posted: 29 May 2017, 12:58:23 UTC - in response to Message 30521.  
Last modified: 29 May 2017, 14:29:20 UTC

Message from Laurence, "Next week I will implement the automatic shutdown of tasks as a priority."


An automated kill switch has been implemented. The sending of tasks should be stopped when the queue is too low. We will find out if this works the next time there is a problem.
ID: 30543 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 883
Credit: 5,852,630
RAC: 84
Message 30546 - Posted: 29 May 2017, 15:56:54 UTC - in response to Message 30543.  

Message from Laurence, "Next week I will implement the automatic shutdown of tasks as a priority."


An automated kill switch has been implemented. The sending of tasks should be stopped when the queue is too low. We will find out if this works the next time there is a problem.

Thanks, Laurence.
ID: 30546 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 883
Credit: 5,852,630
RAC: 84
Message 30549 - Posted: 29 May 2017, 18:28:40 UTC

I've just spotted that our WMAgent has problems again, and the queue is depleting. Best take evasive action if necessary until we see if it gets fixed. There were another two down when I first noticed it, but they are back again so maybe someone is already working on it (my monitors only update every five minutes).
ID: 30549 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 883
Credit: 5,852,630
RAC: 84
Message 30551 - Posted: 29 May 2017, 19:24:23 UTC - in response to Message 30549.  

I've just spotted that our WMAgent has problems again, and the queue is depleting. Best take evasive action if necessary until we see if it gets fixed. There were another two down when I first noticed it, but they are back again so maybe someone is already working on it (my monitors only update every five minutes).

No change yet:

agent: vocms0159.cern.ch (1.1.2.patch2)
agent last updated: 2017/5/29 (Mon) 19:20:19 UTC : 0 h 0 m
data last updated: N/A
status: Components or Thread down;
team: testbed-vocms0159
ID: 30551 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 22 · Next

Message boards : CMS Application : CMS Tasks Failing


©2022 CERN