Message boards : CMS Application : CMS Tasks Failing
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 5 · 6 · 7 · 8 · 9 · 10 · 11 . . . 22 · Next

AuthorMessage
Erich56

Send message
Joined: 18 Dec 15
Posts: 1814
Credit: 118,463,745
RAC: 29,779
Message 31829 - Posted: 5 Aug 2017, 10:21:08 UTC - in response to Message 31824.  

+ 1 !!!
ID: 31829 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1060
Credit: 7,737,455
RAC: 1,317
Message 31830 - Posted: 5 Aug 2017, 11:25:25 UTC - in response to Message 31829.  

Thanks, guys. That does give a boost.
ID: 31830 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1060
Credit: 7,737,455
RAC: 1,317
Message 31832 - Posted: 5 Aug 2017, 11:40:41 UTC - in response to Message 31822.  
Last modified: 5 Aug 2017, 11:42:13 UTC

Fortuitously not only was I up early enough to notice the failure as it happened, but also Alan was awake and aware to fix the problem within minutes! I'd give us 10/10 for responsiveness, YMMV.

From this graph you can see that when I woke up about 0520 local there were still jobs running, and that Alan had it fixed before 0800. (We are user cmst1.)
ID: 31832 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1060
Credit: 7,737,455
RAC: 1,317
Message 31943 - Posted: 13 Aug 2017, 18:23:55 UTC
Last modified: 13 Aug 2017, 18:24:25 UTC

I'm seeing some strange behaviour in my CMS job monitoring graphs at the moment. I'm not sure if this is related to a WMAgent component failure early Saturday morning -- that was a database error that doesn't appear to have any tangible effect on our operation. However...
I'm seeing a gradual decline in the number of queued jobs, over about 1.5 hours, from our normal 700 jobs to 150-200 jobs, but then the queue jumps back up to 700, again for another 1.5 hours or so before starting another decline.
At the moment the only advice I can give is to watch your error rates from time to time, and be prepared to set No New Tasks or take other evasive actions, until I manage to contact someone at CERN who is not on holidays.
ID: 31943 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1060
Credit: 7,737,455
RAC: 1,317
Message 31946 - Posted: 14 Aug 2017, 13:26:57 UTC - in response to Message 31943.  

I'm seeing some strange behaviour in my CMS job monitoring graphs at the moment. I'm not sure if this is related to a WMAgent component failure early Saturday morning -- that was a database error that doesn't appear to have any tangible effect on our operation. However...
I'm seeing a gradual decline in the number of queued jobs, over about 1.5 hours, from our normal 700 jobs to 150-200 jobs, but then the queue jumps back up to 700, again for another 1.5 hours or so before starting another decline.
At the moment the only advice I can give is to watch your error rates from time to time, and be prepared to set No New Tasks or take other evasive actions, until I manage to contact someone at CERN who is not on holidays.

All WMAgent components are now running again, and the strange queue behaviour seems to have gone away.
ID: 31946 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1814
Credit: 118,463,745
RAC: 29,779
Message 32234 - Posted: 4 Sep 2017, 16:55:56 UTC

did anyone else have the same problem like I had on all my PCs this afternoon: all tasks failed after 10-14 minutes.

Now it seems to be back at normal.
ID: 32234 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2534
Credit: 253,825,089
RAC: 37,015
Message 32235 - Posted: 4 Sep 2017, 17:04:56 UTC - in response to Message 32234.  

did anyone else have the same problem like I had on all my PCs this afternoon: all tasks failed after 10-14 minutes.

Now it seems to be back at normal.

See here:
https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4413&postid=32216
ID: 32235 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1814
Credit: 118,463,745
RAC: 29,779
Message 32266 - Posted: 5 Sep 2017, 14:50:15 UTC

anyone any idea what's going on? All CMS tasks on my computers failing since several hours ago.
ID: 32266 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2534
Credit: 253,825,089
RAC: 37,015
Message 32270 - Posted: 5 Sep 2017, 15:18:43 UTC - in response to Message 32266.  

Erich56 wrote:
anyone any idea what's going on? All CMS tasks on my computers failing since several hours ago.

At least a couple of them failed with error 207 (0x000000CF) EXIT_NO_SUB_TASKS.
Meanwhile the emergency break was activated and the server queue is down to 0.

Ivan, are you aware of this?
ID: 32270 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1060
Credit: 7,737,455
RAC: 1,317
Message 32275 - Posted: 5 Sep 2017, 15:56:04 UTC - in response to Message 32270.  

Erich56 wrote:
anyone any idea what's going on? All CMS tasks on my computers failing since several hours ago.

At least a couple of them failed with error 207 (0x000000CF) EXIT_NO_SUB_TASKS.
Meanwhile the emergency break was activated and the server queue is down to 0.

Ivan, are you aware of this?

Trying desperately to find out what is going on. Problem is I was on Eurostar all afternoon and now I'm stuck in a hotel room in Montparnasse with just an Android tablet and a netbook newly converted to Mint Linux for company.
The queue is running down. The last batch is used up. I submitted a new batch yesterday but jobs are not being put into the queue. I also submitted new credentials for the VMs that Laurence runs at CERN (and one of mine...); that should not affect Volunteers'machines.
The server status shows no tasks in the CMS queue. I do not know yet whether that is because the queue ran out and Laurence's brake kicked in or if there are still some issues from yesterday's CERN problems. WMAgent was in a funny state yesterday; Alan just restarted it and it's looking fine at the moment.
I'll try to contact Laurence, who is probably in an AirBnB not too far away but e-mail will probably be quicker than knocking door-to-door.
ID: 32275 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1060
Credit: 7,737,455
RAC: 1,317
Message 32283 - Posted: 5 Sep 2017, 18:18:29 UTC - in response to Message 32275.  

I might have found the problem. I found an error log stating that there were problems with the agent, so it was probably already in a confused state when I submitted the new batch. I aborted it and submitted another. It will take a little while to start queueing jobs (if it does...) and then I'm not sure if Laurence has to manually remove the block on task submissions or if it happens automagically,
ID: 32283 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1060
Credit: 7,737,455
RAC: 1,317
Message 32285 - Posted: 5 Sep 2017, 18:33:43 UTC - in response to Message 32283.  

No, same error... :-(
ID: 32285 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1060
Credit: 7,737,455
RAC: 1,317
Message 32292 - Posted: 5 Sep 2017, 20:19:41 UTC - in response to Message 32285.  

No, same error... :-(

Ah, a certain developer forgot to install a patch yesterday. I think we are rolling out tasks and jobs again. Sorry for the disruptions.
ID: 32292 · Report as offensive     Reply Quote
Rasputin42

Send message
Joined: 26 Dec 09
Posts: 10
Credit: 1,192,862
RAC: 0
Message 32332 - Posted: 7 Sep 2017, 14:15:49 UTC
Last modified: 7 Sep 2017, 14:16:18 UTC

Could someone please set up some more CMS-Tasks on the dev-project?

There have not been any for the past 3 days or so.
ID: 32332 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1060
Credit: 7,737,455
RAC: 1,317
Message 32339 - Posted: 7 Sep 2017, 22:47:14 UTC - in response to Message 32332.  
Last modified: 7 Sep 2017, 22:47:54 UTC

Could someone please set up some more CMS-Tasks on the dev-project?

There have not been any for the past 3 days or so.

Hmm, you're right. Sorry, I was in Paris for BOINC Workshop 2017 (just got home) and didn't notice. Probably related to all the problems CERN had the other day; I'll tickle the relevants.
ID: 32339 · Report as offensive     Reply Quote
Rasputin42

Send message
Joined: 26 Dec 09
Posts: 10
Credit: 1,192,862
RAC: 0
Message 32340 - Posted: 8 Sep 2017, 5:12:50 UTC - in response to Message 32339.  

Thanks, Ivan.
ID: 32340 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1060
Credit: 7,737,455
RAC: 1,317
Message 32345 - Posted: 8 Sep 2017, 10:24:05 UTC - in response to Message 32340.  

Thanks, Ivan.

No change yet, but I believe Laurence is still in Paris for a BOINC "hackathon" today.
ID: 32345 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1814
Credit: 118,463,745
RAC: 29,779
Message 32715 - Posted: 9 Oct 2017, 10:58:53 UTC
Last modified: 9 Oct 2017, 11:03:39 UTC

for the past few hours, all newly started CMS tasks fail and finish after about 10-14 minutes.

What's the problem?
From the pattern shown, I would guess it's the WMAgent?
ID: 32715 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1060
Credit: 7,737,455
RAC: 1,317
Message 32716 - Posted: 9 Oct 2017, 12:13:11 UTC - in response to Message 32715.  

for the past few hours, all newly started CMS tasks fail and finish after about 10-14 minutes.

What's the problem?
From the pattern shown, I would guess it's the WMAgent?

Just noticed that myself. It doesn't appear to be WMAgent, that's not showing any errors. I'll alert Laurence.
ID: 32716 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1060
Credit: 7,737,455
RAC: 1,317
Message 32720 - Posted: 9 Oct 2017, 15:28:34 UTC - in response to Message 32716.  

Laurence thinks he's found the problem but we have to wait a while to see if things recover.
ID: 32720 · Report as offensive     Reply Quote
Previous · 1 . . . 5 · 6 · 7 · 8 · 9 · 10 · 11 . . . 22 · Next

Message boards : CMS Application : CMS Tasks Failing


©2024 CERN