Message boards : CMS Application : "No jobs were available to run" since this morning.
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,335,815
RAC: 102,435
Message 30797 - Posted: 16 Jun 2017, 4:19:59 UTC

VMAgent down again?

stderr says "2017-06-16 05:43:39 (7528): VM Completion Message: No jobs were available to run"
ID: 30797 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 997
Credit: 6,264,307
RAC: 71
Message 30799 - Posted: 16 Jun 2017, 8:12:05 UTC - in response to Message 30797.  

VMAgent down again?

stderr says "2017-06-16 05:43:39 (7528): VM Completion Message: No jobs were available to run"

Yes:

agent: vocms0159.cern.ch (1.1.2.patch2)
agent last updated: 2017/6/16 (Fri) 08:06:34 UTC : 0 h 4 m
data last updated: N/A
status: Components or Thread down;
team: testbed-vocms0159

I've messaged Alan and Laurence.
ID: 30799 · Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer

Send message
Joined: 20 Jun 14
Posts: 372
Credit: 238,712
RAC: 0
Message 30800 - Posted: 16 Jun 2017, 8:41:24 UTC - in response to Message 30799.  
Last modified: 16 Jun 2017, 8:41:37 UTC

The automatic brake worked but I will reduced the buffer.
ID: 30800 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,335,815
RAC: 102,435
Message 30801 - Posted: 16 Jun 2017, 8:56:11 UTC - in response to Message 30800.  

The automatic brake worked ...

okay, but this would only eliminate the symtoms, but not the cause.

The cause, as mostly, is the WMAgent. What's wrong with the WMAgent so that it fails every other week?
ID: 30801 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 997
Credit: 6,264,307
RAC: 71
Message 30803 - Posted: 16 Jun 2017, 9:31:44 UTC - in response to Message 30801.  

The automatic brake worked ...

okay, but this would only eliminate the symtoms, but not the cause.

The cause, as mostly, is the WMAgent. What's wrong with the WMAgent so that it fails every other week?

It's a complex system. In fact this same problem has apparently been affecting the production systems as well. There's a new release being prepared, it will probably be deployed in a week or so.
ID: 30803 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 997
Credit: 6,264,307
RAC: 71
Message 30805 - Posted: 16 Jun 2017, 13:16:17 UTC - in response to Message 30803.  

For some understanding of the complexity of WMAgent, take a look at this wiki.
ID: 30805 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,335,815
RAC: 102,435
Message 30807 - Posted: 16 Jun 2017, 14:59:49 UTC - in response to Message 30805.  

For some understanding of the complexity of WMAgent, take a look at this wiki.

Thanks, Ivan, for providing the link.
So, let's keep our fingers crossed that the new release will be more stable :-)
ID: 30807 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 997
Credit: 6,264,307
RAC: 71
Message 30833 - Posted: 19 Jun 2017, 0:18:35 UTC

I see another WMAgent problem. I don't expect a fix at this time on a Sunday night. :-/
ID: 30833 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,335,815
RAC: 102,435
Message 30834 - Posted: 19 Jun 2017, 3:06:06 UTC - in response to Message 30833.  

I see another WMAgent problem.

It really seems to be time to implement the new release of the WMAgent :-)
ID: 30834 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 997
Credit: 6,264,307
RAC: 71
Message 30841 - Posted: 19 Jun 2017, 7:32:59 UTC - in response to Message 30834.  
Last modified: 19 Jun 2017, 7:40:50 UTC

I see another WMAgent problem.

It really seems to be time to implement the new release of the WMAgent :-)

I guess we'll find out soon enough. :-/

[Edit] Ah, it wasn't a WMAgent problem per se, but a side-effect of a network problem. See https://cern.service-now.com/service-portal/view-outage.do?from=CSP-Service-Status-Board&n=OTG0038195 if it's a public URL. [/Edit]
ID: 30841 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,335,815
RAC: 102,435
Message 30846 - Posted: 19 Jun 2017, 9:02:32 UTC - in response to Message 30841.  
Last modified: 19 Jun 2017, 9:06:13 UTC

Right now, more than 11.700 "unsent" tasks are shown on the Project Status page; however, when trying to fetch work, BOINC says "no tasks available" - on all my hosts.

Why so?
ID: 30846 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 997
Credit: 6,264,307
RAC: 71
Message 30847 - Posted: 19 Jun 2017, 9:10:22 UTC - in response to Message 30846.  

Right now, more than 11.700 "unsent" tasks are shown on the Project Status page; however, when trying to fetch work, BOINC says "no tasks available" - on all my hosts.

Why so?

Yes, I see what you mean. Looks like something needs to be tickled server-side.
ID: 30847 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,335,815
RAC: 102,435
Message 30855 - Posted: 19 Jun 2017, 10:51:20 UTC - in response to Message 30847.  

from what it looks like, there may be a major server problem - now ATLAS tasks cannot be downloaded either (although 16.500 shown as "unsent")
:-)
ID: 30855 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 997
Credit: 6,264,307
RAC: 71
Message 30864 - Posted: 19 Jun 2017, 13:47:25 UTC

I just got new CMS tasks on both my machines.
ID: 30864 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,335,815
RAC: 102,435
Message 30877 - Posted: 19 Jun 2017, 15:26:09 UTC - in response to Message 30864.  

I received CMS jobs as well as ATLAS jobs.

However, one CMS errored out with "computation error" after 12 minutes.
STDERR says:
2017-06-19 17:13:02 (6060): VM Heartbeat file specified, but missing.
2017-06-19 17:13:02 (6060): VM Heartbeat file specified, but missing file system status. (errno = '2')

Any idea what the reason for this could be?
ID: 30877 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,335,815
RAC: 102,435
Message 30938 - Posted: 22 Jun 2017, 9:57:56 UTC - in response to Message 30803.  

Ivan wrote last week:
There's a new release (of the WMAgent) being prepared, it will probably be deployed in a week or so.

Ivan, is this taking place right now, and the reason why no tasks (CMS and others as well) can be downloaded?
ID: 30938 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,335,815
RAC: 102,435
Message 30942 - Posted: 22 Jun 2017, 11:07:43 UTC

two of my hosts just downladed new CMS tasks.
So all seems to run well again, whatever the reason for the disruption was.
ID: 30942 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 997
Credit: 6,264,307
RAC: 71
Message 30949 - Posted: 22 Jun 2017, 14:28:46 UTC - in response to Message 30938.  

Ivan wrote last week:
There's a new release (of the WMAgent) being prepared, it will probably be deployed in a week or so.

Ivan, is this taking place right now, and the reason why no tasks (CMS and others as well) can be downloaded?

No, not as far as I'm aware. I know that Laurence is tinkering with his cluster, but I doubt that would affect all projects. I'll be sure to give you as much warning as I can when the WMAgent update is scheduled.
ID: 30949 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,335,815
RAC: 102,435
Message 30950 - Posted: 22 Jun 2017, 14:39:22 UTC - in response to Message 30949.  

I'll be sure to give you as much warning as I can when the WMAgent update is scheduled.

Thanks a lot, Ivan
ID: 30950 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2071
Credit: 156,083,677
RAC: 105,711
Message 31037 - Posted: 25 Jun 2017, 11:15:54 UTC

The Counter for CMS-Tasks went also to ZERO. (500 at the moment)
ID: 31037 · Report as offensive     Reply Quote
1 · 2 · Next

Message boards : CMS Application : "No jobs were available to run" since this morning.


©2024 CERN