Message boards : CMS Application : no new WUs available
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · 4 . . . 9 · Next

AuthorMessage
Erich56

Send message
Joined: 18 Dec 15
Posts: 1284
Credit: 23,160,730
RAC: 2,238
Message 29855 - Posted: 7 Apr 2017, 5:14:51 UTC

for some time now, the Project Status page shows "0" for CMS WUs (and other projects as well).

Major problem somewhere?
ID: 29855 · Report as offensive     Reply Quote
Phil
Avatar

Send message
Joined: 26 Jul 05
Posts: 63
Credit: 4,083,755
RAC: 0
Message 29858 - Posted: 7 Apr 2017, 6:59:06 UTC - in response to Message 29855.  

yep, something is asleep:

Transitioner backlog (hours)	8.11

ID: 29858 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1284
Credit: 23,160,730
RAC: 2,238
Message 33003 - Posted: 6 Nov 2017, 13:31:25 UTC

there seem no new WUs to be available for the past few hours - is it a short-term problem only, or something else?
ID: 33003 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 678
Credit: 5,525,579
RAC: 2,460
Message 33004 - Posted: 6 Nov 2017, 13:54:15 UTC - in response to Message 33003.  

there seem no new WUs to be available for the past few hours - is it a short-term problem only, or something else?

Sorry, I miscalculated the length of a new batch and it ran out sooner than I expected. Should be new jobs soon.
ID: 33004 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1284
Credit: 23,160,730
RAC: 2,238
Message 33005 - Posted: 6 Nov 2017, 13:58:42 UTC

thanks for the quick information, Ivan :-)
ID: 33005 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 678
Credit: 5,525,579
RAC: 2,460
Message 33006 - Posted: 6 Nov 2017, 14:09:19 UTC - in response to Message 33005.  

The good news is that the new WMAgent has been successfully tickled into life, so I'll submit a larger batch there.
ID: 33006 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1284
Credit: 23,160,730
RAC: 2,238
Message 33018 - Posted: 8 Nov 2017, 5:59:52 UTC

good morning, Ivan

again, no tasks available (a few hours before, there were tasks but no jobs). Please fill the queue :-)
ID: 33018 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 678
Credit: 5,525,579
RAC: 2,460
Message 33019 - Posted: 8 Nov 2017, 8:19:18 UTC - in response to Message 33018.  

good morning, Ivan

again, no tasks available (a few hours before, there were tasks but no jobs). Please fill the queue :-)

Hi;
I was a bit surprised, as the batch estimate was another 30 hours or more last night. I quickly dispatched a new batch but when I looked at WMStats there were no data. Seems at least one WMAgent component has failed (AnalyticsDataCollector). The maintainers have been notified.
Sorry for the inconvenience. The new batch has arrived in the system but we may not see any action until WMAGent is tickled back into life.
ID: 33019 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 678
Credit: 5,525,579
RAC: 2,460
Message 33021 - Posted: 8 Nov 2017, 8:39:35 UTC - in response to Message 33019.  

OK, WMAgent has been reset, and the job queue is filling again. Some of the monitors lag a bit (esp. Server Status and Job Activities) so it might not be obvious yet. Looks like the monitor that stops task creation when there are no jobs is doing its job.
ID: 33021 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 678
Credit: 5,525,579
RAC: 2,460
Message 33040 - Posted: 10 Nov 2017, 8:23:14 UTC - in response to Message 33021.  
Last modified: 10 Nov 2017, 8:25:25 UTC

OK, WMAgent has been reset, and the job queue is filling again. Some of the monitors lag a bit (esp. Server Status and Job Activities) so it might not be obvious yet. Looks like the monitor that stops task creation when there are no jobs is doing its job.

OK, it stopped again just now, and the queue is draining fast. Please set No New Tasks or otherwise prepare for lack of CMS jobs. I've mailed CERN; they may be delving deeper into the failure before restarting. More when I get into work in ~30 mins.
[EDIT] OK, don't panic! WMAgent has been restarted and the queue is starting to fill again. As you were! Now I can be a bit more leisurely about my stroll down Church Road. :-) [/EDIT/
ID: 33040 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1284
Credit: 23,160,730
RAC: 2,238
Message 33041 - Posted: 10 Nov 2017, 9:05:56 UTC

If I remember correctly, they put in place a new release of the WMAgent a few months ago or so.
As it seems, it didn't exactly make things better :-)
ID: 33041 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 678
Credit: 5,525,579
RAC: 2,460
Message 33045 - Posted: 10 Nov 2017, 16:24:19 UTC - in response to Message 33041.  

If I remember correctly, they put in place a new release of the WMAgent a few months ago or so.
As it seems, it didn't exactly make things better :-)

Actually, that was in a different VM; this is in a new VM running CERN CentOS 7, but the way it's configured it can't handle the number of jobs we are running (mySQL crashes). I'l be reverting back to the old SLC6 VM until a more capable VM can be provisioned.
ID: 33045 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 678
Credit: 5,525,579
RAC: 2,460
Message 33046 - Posted: 11 Nov 2017, 11:33:11 UTC - in response to Message 33045.  

The WMAgent died again, I've notified CERN. Should only be a few hours left in this batch; I've submitted a new batch to the old WMAgent and it's ready to go when this batch clears.
ID: 33046 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 678
Credit: 5,525,579
RAC: 2,460
Message 33047 - Posted: 11 Nov 2017, 12:56:30 UTC - in response to Message 33046.  

We are picking up jobs from the new batch already; the new WMAgent is still down.
ID: 33047 · Report as offensive     Reply Quote
Profile ritterm
Avatar

Send message
Joined: 30 May 08
Posts: 93
Credit: 5,160,246
RAC: 0
Message 33166 - Posted: 30 Nov 2017, 2:34:33 UTC
Last modified: 30 Nov 2017, 2:38:26 UTC

Looks like the CMS queue is running dry... Server Status shows 7 unsent tasks and I returned 3 results that ended like this:

2017-11-30 00:34:19 (23855): Guest Log: [ERROR] No jobs were available to run.
ID: 33166 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1284
Credit: 23,160,730
RAC: 2,238
Message 33170 - Posted: 30 Nov 2017, 5:47:32 UTC - in response to Message 33166.  

other subprojects (like LHCb) are also failing.
Looks as if the WMAgent is down again.
ID: 33170 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 678
Credit: 5,525,579
RAC: 2,460
Message 33172 - Posted: 30 Nov 2017, 8:07:22 UTC - in response to Message 33166.  

Yes, sorry about that; the switch-over to the new new WMAgent happened after I'd gone to bed, and it seems we weren't primed to take its jobs. I've notified CERN and will submit a small batch to the old agent in the meantime to try to get some jobs available again.
ID: 33172 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 678
Credit: 5,525,579
RAC: 2,460
Message 33173 - Posted: 30 Nov 2017, 8:52:34 UTC - in response to Message 33172.  

The new batch is now serving jobs; I'll keep an eye on the situation throughout the day.
ID: 33173 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 962
Credit: 34,012,022
RAC: 8,132
Message 33200 - Posted: 4 Dec 2017, 7:08:13 UTC

The Server shows ZERO tasks for the moment!
ID: 33200 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 678
Credit: 5,525,579
RAC: 2,460
Message 33201 - Posted: 4 Dec 2017, 8:56:48 UTC - in response to Message 33200.  

The Server shows ZERO tasks for the moment!

That's strange, because we have 2,000 in the queue. There are a couple of other little incongruities -- I'm starting to suspect that the new WMAgent channel we brought up last week is only serving the CERN VMs, and not Volunteer machines. I'll contact the team...
ID: 33201 · Report as offensive     Reply Quote
1 · 2 · 3 · 4 . . . 9 · Next

Message boards : CMS Application : no new WUs available


©2020 CERN