Message boards :
CMS Application :
CMS Tasks Failing
Message board moderation
Previous · 1 . . . 5 · 6 · 7 · 8 · 9 · 10 · 11 . . . 22 · Next
Author | Message |
---|---|
Send message Joined: 18 Dec 15 Posts: 1821 Credit: 118,940,891 RAC: 22,334 |
+ 1 !!! |
Send message Joined: 29 Aug 05 Posts: 1061 Credit: 7,737,455 RAC: 298 |
|
Send message Joined: 29 Aug 05 Posts: 1061 Credit: 7,737,455 RAC: 298 |
Fortuitously not only was I up early enough to notice the failure as it happened, but also Alan was awake and aware to fix the problem within minutes! I'd give us 10/10 for responsiveness, YMMV. From this graph you can see that when I woke up about 0520 local there were still jobs running, and that Alan had it fixed before 0800. (We are user cmst1.) |
Send message Joined: 29 Aug 05 Posts: 1061 Credit: 7,737,455 RAC: 298 |
I'm seeing some strange behaviour in my CMS job monitoring graphs at the moment. I'm not sure if this is related to a WMAgent component failure early Saturday morning -- that was a database error that doesn't appear to have any tangible effect on our operation. However... I'm seeing a gradual decline in the number of queued jobs, over about 1.5 hours, from our normal 700 jobs to 150-200 jobs, but then the queue jumps back up to 700, again for another 1.5 hours or so before starting another decline. At the moment the only advice I can give is to watch your error rates from time to time, and be prepared to set No New Tasks or take other evasive actions, until I manage to contact someone at CERN who is not on holidays. |
Send message Joined: 29 Aug 05 Posts: 1061 Credit: 7,737,455 RAC: 298 |
I'm seeing some strange behaviour in my CMS job monitoring graphs at the moment. I'm not sure if this is related to a WMAgent component failure early Saturday morning -- that was a database error that doesn't appear to have any tangible effect on our operation. However... All WMAgent components are now running again, and the strange queue behaviour seems to have gone away. |
Send message Joined: 18 Dec 15 Posts: 1821 Credit: 118,940,891 RAC: 22,334 |
did anyone else have the same problem like I had on all my PCs this afternoon: all tasks failed after 10-14 minutes. Now it seems to be back at normal. |
Send message Joined: 15 Jun 08 Posts: 2541 Credit: 254,608,838 RAC: 34,609 |
did anyone else have the same problem like I had on all my PCs this afternoon: all tasks failed after 10-14 minutes. See here: https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4413&postid=32216 |
Send message Joined: 18 Dec 15 Posts: 1821 Credit: 118,940,891 RAC: 22,334 |
anyone any idea what's going on? All CMS tasks on my computers failing since several hours ago. |
Send message Joined: 15 Jun 08 Posts: 2541 Credit: 254,608,838 RAC: 34,609 |
Erich56 wrote: anyone any idea what's going on? All CMS tasks on my computers failing since several hours ago. At least a couple of them failed with error 207 (0x000000CF) EXIT_NO_SUB_TASKS. Meanwhile the emergency break was activated and the server queue is down to 0. Ivan, are you aware of this? |
Send message Joined: 29 Aug 05 Posts: 1061 Credit: 7,737,455 RAC: 298 |
Erich56 wrote:anyone any idea what's going on? All CMS tasks on my computers failing since several hours ago. Trying desperately to find out what is going on. Problem is I was on Eurostar all afternoon and now I'm stuck in a hotel room in Montparnasse with just an Android tablet and a netbook newly converted to Mint Linux for company. The queue is running down. The last batch is used up. I submitted a new batch yesterday but jobs are not being put into the queue. I also submitted new credentials for the VMs that Laurence runs at CERN (and one of mine...); that should not affect Volunteers'machines. The server status shows no tasks in the CMS queue. I do not know yet whether that is because the queue ran out and Laurence's brake kicked in or if there are still some issues from yesterday's CERN problems. WMAgent was in a funny state yesterday; Alan just restarted it and it's looking fine at the moment. I'll try to contact Laurence, who is probably in an AirBnB not too far away but e-mail will probably be quicker than knocking door-to-door. |
Send message Joined: 29 Aug 05 Posts: 1061 Credit: 7,737,455 RAC: 298 |
I might have found the problem. I found an error log stating that there were problems with the agent, so it was probably already in a confused state when I submitted the new batch. I aborted it and submitted another. It will take a little while to start queueing jobs (if it does...) and then I'm not sure if Laurence has to manually remove the block on task submissions or if it happens automagically, |
Send message Joined: 29 Aug 05 Posts: 1061 Credit: 7,737,455 RAC: 298 |
|
Send message Joined: 29 Aug 05 Posts: 1061 Credit: 7,737,455 RAC: 298 |
|
Send message Joined: 26 Dec 09 Posts: 10 Credit: 1,192,862 RAC: 0 |
Could someone please set up some more CMS-Tasks on the dev-project? There have not been any for the past 3 days or so. |
Send message Joined: 29 Aug 05 Posts: 1061 Credit: 7,737,455 RAC: 298 |
Could someone please set up some more CMS-Tasks on the dev-project? Hmm, you're right. Sorry, I was in Paris for BOINC Workshop 2017 (just got home) and didn't notice. Probably related to all the problems CERN had the other day; I'll tickle the relevants. |
Send message Joined: 26 Dec 09 Posts: 10 Credit: 1,192,862 RAC: 0 |
Thanks, Ivan. |
Send message Joined: 29 Aug 05 Posts: 1061 Credit: 7,737,455 RAC: 298 |
|
Send message Joined: 18 Dec 15 Posts: 1821 Credit: 118,940,891 RAC: 22,334 |
for the past few hours, all newly started CMS tasks fail and finish after about 10-14 minutes. What's the problem? From the pattern shown, I would guess it's the WMAgent? |
Send message Joined: 29 Aug 05 Posts: 1061 Credit: 7,737,455 RAC: 298 |
|
Send message Joined: 29 Aug 05 Posts: 1061 Credit: 7,737,455 RAC: 298 |
|
©2024 CERN