Message boards : CMS Application : CMS Tasks Failing
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · 4 . . . 17 · Next

AuthorMessage
Brummig
Avatar

Send message
Joined: 9 Feb 16
Posts: 35
Credit: 441,057
RAC: 299
Message 29330 - Posted: 16 Mar 2017, 15:43:43 UTC
Last modified: 16 Mar 2017, 15:43:58 UTC

ID: 29330 · Report as offensive     Reply Quote
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 589
Credit: 371,025,353
RAC: 14,672
Message 29335 - Posted: 16 Mar 2017, 18:43:26 UTC - in response to Message 29330.  

They are both networking error from your computer not being able to establish connection to CERN.

Generally it's a random error.
ID: 29335 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 654
Credit: 4,937,401
RAC: 747
Message 29378 - Posted: 18 Mar 2017, 10:51:02 UTC

Sorry, folks, one of our CERN servers failed during the night so no new jobs are being sent. To preserve your task quotas, you should set your CMS machines to "no new tasks" or suspend BOINC. I've notified the person responsible but it's the weekend... If there's no response by later today I'll raise a trouble ticket directly with CERN IT.
ID: 29378 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 654
Credit: 4,937,401
RAC: 747
Message 29382 - Posted: 18 Mar 2017, 11:52:52 UTC - in response to Message 29378.  

Sorry, folks, one of our CERN servers failed during the night so no new jobs are being sent. To preserve your task quotas, you should set your CMS machines to "no new tasks" or suspend BOINC. I've notified the person responsible but it's the weekend... If there's no response by later today I'll raise a trouble ticket directly with CERN IT.

Fortunately my contact was reading his mail. The server raised an apparently-spurious database alarm and stopped. It's been restarted and I see evidence of jobs getting into the queue again so I think it's safe to continue now.
ID: 29382 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1255
Credit: 22,992,130
RAC: 3,439
Message 29443 - Posted: 20 Mar 2017, 16:50:35 UTC

Unfortunately, I had several CMS tasks failing after some 10-12 minutes, this afternoon.

As an example, please see here:

https://lhcathome.cern.ch/lhcathome/result.php?resultid=127237080

Any of the "experts" able to tell me what was going wrong?
ID: 29443 · Report as offensive     Reply Quote
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 589
Credit: 371,025,353
RAC: 14,672
Message 29465 - Posted: 20 Mar 2017, 21:43:36 UTC

This look like to me that there is no CMS work, it pinged OK but got no work.

The tasks for CMS fail after 10min if there is an initialization problem (for a number of reasons often network issues)

This is so they fail fast without wasting CPU time on your computer.

The failure rate of tasks in the project in my analysis was about 15% before, I'm seeing 8% at the moment and I know some is my own making so thing are more stable than before.
ID: 29465 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 654
Credit: 4,937,401
RAC: 747
Message 29467 - Posted: 20 Mar 2017, 23:27:49 UTC - in response to Message 29443.  

Unfortunately, I had several CMS tasks failing after some 10-12 minutes, this afternoon.

As an example, please see here:

https://lhcathome.cern.ch/lhcathome/result.php?resultid=127237080

Any of the "experts" able to tell me what was going wrong?

Sorry, I literally fell asleep at the wheel... I did a monthly update on my Windows machine at work, and rested my eyes while it was chugging away. I woke up an hour later, to find that I was just that much too late to submit a new batch of jobs -- I'd been letting it run down to the wire because an Italian colleague is also testing WMAgent job submission, and I wanted her batch to get in the queue first. Only, she didn't read my mail until later this afternoon and didn't submit a batch until then.
So, we ran out of queued jobs for nearly an hour while my new batch wended its way through the submission queue. We got down to about 170 jobs still running before the new jobs entered the queue and started running as tasks came online and requested them. There's been a little pogoing of the running job count while we recover.
Moral of the story: I should watch the pending queue rather than the batch estimated-time-to-completion, and perhaps I should go to bed a little earlier on Sunday nights. :-)
ID: 29467 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 654
Credit: 4,937,401
RAC: 747
Message 29540 - Posted: 22 Mar 2017, 21:14:08 UTC

Here we go again -- something else in WMAgent has died. Set No New Tasks to protect your daily quota. :-(
ID: 29540 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1255
Credit: 22,992,130
RAC: 3,439
Message 29545 - Posted: 23 Mar 2017, 4:11:12 UTC

after getting up this morning, I noticed a lot of failed tasks, all of them ran only about 10 minutes and then failed.
one line of the log:

[ERROR] Condor exited after 626s without running a job.

See example here:

https://lhcathome.cern.ch/lhcathome/result.php?resultid=127799651

what's going wrong?
ID: 29545 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 932
Credit: 6,284,204
RAC: 718
Message 29547 - Posted: 23 Mar 2017, 8:31:16 UTC - in response to Message 29545.  

what's going wrong?

Read Ivan's previous post.
ID: 29547 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 654
Credit: 4,937,401
RAC: 747
Message 29548 - Posted: 23 Mar 2017, 8:31:25 UTC - in response to Message 29545.  

after getting up this morning, I noticed a lot of failed tasks, all of them ran only about 10 minutes and then failed.
one line of the log:

[ERROR] Condor exited after 626s without running a job.

See example here:

https://lhcathome.cern.ch/lhcathome/result.php?resultid=127799651

what's going wrong?

There's been a failure at CERN and there are no jobs to be had. In that case the BOINC task times out after ten minutes. It's best to set No New Tasks or switch to another project until the problem is solved.
ID: 29548 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 654
Credit: 4,937,401
RAC: 747
Message 29549 - Posted: 23 Mar 2017, 8:40:42 UTC

The problem has been traced to an authentication certificate becoming invalid, for reasons as yet unknown. CERN IT are working on it.
ID: 29549 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1255
Credit: 22,992,130
RAC: 3,439
Message 29551 - Posted: 23 Mar 2017, 10:44:00 UTC - in response to Message 29549.  

The problem has been traced to an authentication certificate becoming invalid, for reasons as yet unknown. CERN IT are working on it.

Thanks, Ivan, for the Information.
Would you please inform us here when it works again - :-)
ID: 29551 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 654
Credit: 4,937,401
RAC: 747
Message 29558 - Posted: 23 Mar 2017, 12:12:26 UTC - in response to Message 29551.  

CMS@Home jobs are available again. I'll continue to monitor the situation in case we have a Total Inability To Support Usual Procedures situation again.
ID: 29558 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1255
Credit: 22,992,130
RAC: 3,439
Message 29565 - Posted: 23 Mar 2017, 15:15:43 UTC - in response to Message 29558.  

CMS@Home jobs are available again.

I have a task running well for 28 minutes now.
All seems to be okay again :-)
ID: 29565 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 422
Credit: 22,573,301
RAC: 7,065
Message 29770 - Posted: 2 Apr 2017, 11:29:54 UTC
Last modified: 2 Apr 2017, 11:34:30 UTC

I just got three tasks failing for the condor ping. I did finish one succesfully just before them (about 9 hours long).
ID: 29770 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 654
Credit: 4,937,401
RAC: 747
Message 29771 - Posted: 2 Apr 2017, 12:02:47 UTC - in response to Message 29770.  

I just got three tasks failing for the condor ping. I did finish one succesfully just before them (about 9 hours long).

The WMAgent server has fallen over at CERN. Please set no new tasks until I can raise someone to fix it.
ID: 29771 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 654
Credit: 4,937,401
RAC: 747
Message 29772 - Posted: 2 Apr 2017, 13:08:20 UTC - in response to Message 29771.  

I just got three tasks failing for the condor ping. I did finish one succesfully just before them (about 9 hours long).

The WMAgent server has fallen over at CERN. Please set no new tasks until I can raise someone to fix it.

We seem to have jobs again!
ID: 29772 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 654
Credit: 4,937,401
RAC: 747
Message 29898 - Posted: 10 Apr 2017, 22:52:15 UTC

WMAgent appears to have died again. Please set No New Tasks if you can. (I can't at the moment, my work laptop has died and I'm stuck at home with just an Android tablet and a Win10 tablet. 😢)
ID: 29898 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 654
Credit: 4,937,401
RAC: 747
Message 29900 - Posted: 11 Apr 2017, 8:45:53 UTC

Back again.
ID: 29900 · Report as offensive     Reply Quote
1 · 2 · 3 · 4 . . . 17 · Next

Message boards : CMS Application : CMS Tasks Failing


©2020 CERN