Message boards :
CMS Application :
CMS Tasks Failing
Message board moderation
Author | Message |
---|---|
Send message Joined: 9 Feb 16 Posts: 48 Credit: 537,111 RAC: 0 |
https://lhcathome.cern.ch/lhcathome/result.php?resultid=124760186 https://lhcathome.cern.ch/lhcathome/result.php?resultid=126075866 Other LHC@Home tasks are running just fine. |
Send message Joined: 27 Sep 08 Posts: 820 Credit: 684,357,036 RAC: 142,963 |
They are both networking error from your computer not being able to establish connection to CERN. Generally it's a random error. |
Send message Joined: 29 Aug 05 Posts: 1048 Credit: 7,510,992 RAC: 7,518 |
Sorry, folks, one of our CERN servers failed during the night so no new jobs are being sent. To preserve your task quotas, you should set your CMS machines to "no new tasks" or suspend BOINC. I've notified the person responsible but it's the weekend... If there's no response by later today I'll raise a trouble ticket directly with CERN IT. |
Send message Joined: 29 Aug 05 Posts: 1048 Credit: 7,510,992 RAC: 7,518 |
Sorry, folks, one of our CERN servers failed during the night so no new jobs are being sent. To preserve your task quotas, you should set your CMS machines to "no new tasks" or suspend BOINC. I've notified the person responsible but it's the weekend... If there's no response by later today I'll raise a trouble ticket directly with CERN IT. Fortunately my contact was reading his mail. The server raised an apparently-spurious database alarm and stopped. It's been restarted and I see evidence of jobs getting into the queue again so I think it's safe to continue now. |
Send message Joined: 18 Dec 15 Posts: 1749 Credit: 115,651,430 RAC: 87,430 |
Unfortunately, I had several CMS tasks failing after some 10-12 minutes, this afternoon. As an example, please see here: https://lhcathome.cern.ch/lhcathome/result.php?resultid=127237080 Any of the "experts" able to tell me what was going wrong? |
Send message Joined: 27 Sep 08 Posts: 820 Credit: 684,357,036 RAC: 142,963 |
This look like to me that there is no CMS work, it pinged OK but got no work. The tasks for CMS fail after 10min if there is an initialization problem (for a number of reasons often network issues) This is so they fail fast without wasting CPU time on your computer. The failure rate of tasks in the project in my analysis was about 15% before, I'm seeing 8% at the moment and I know some is my own making so thing are more stable than before. |
Send message Joined: 29 Aug 05 Posts: 1048 Credit: 7,510,992 RAC: 7,518 |
Unfortunately, I had several CMS tasks failing after some 10-12 minutes, this afternoon. Sorry, I literally fell asleep at the wheel... I did a monthly update on my Windows machine at work, and rested my eyes while it was chugging away. I woke up an hour later, to find that I was just that much too late to submit a new batch of jobs -- I'd been letting it run down to the wire because an Italian colleague is also testing WMAgent job submission, and I wanted her batch to get in the queue first. Only, she didn't read my mail until later this afternoon and didn't submit a batch until then. So, we ran out of queued jobs for nearly an hour while my new batch wended its way through the submission queue. We got down to about 170 jobs still running before the new jobs entered the queue and started running as tasks came online and requested them. There's been a little pogoing of the running job count while we recover. Moral of the story: I should watch the pending queue rather than the batch estimated-time-to-completion, and perhaps I should go to bed a little earlier on Sunday nights. :-) |
Send message Joined: 29 Aug 05 Posts: 1048 Credit: 7,510,992 RAC: 7,518 |
|
Send message Joined: 18 Dec 15 Posts: 1749 Credit: 115,651,430 RAC: 87,430 |
after getting up this morning, I noticed a lot of failed tasks, all of them ran only about 10 minutes and then failed. one line of the log: [ERROR] Condor exited after 626s without running a job. See example here: https://lhcathome.cern.ch/lhcathome/result.php?resultid=127799651 what's going wrong? |
Send message Joined: 14 Jan 10 Posts: 1376 Credit: 9,162,540 RAC: 5,071 |
what's going wrong? Read Ivan's previous post. |
Send message Joined: 29 Aug 05 Posts: 1048 Credit: 7,510,992 RAC: 7,518 |
after getting up this morning, I noticed a lot of failed tasks, all of them ran only about 10 minutes and then failed. There's been a failure at CERN and there are no jobs to be had. In that case the BOINC task times out after ten minutes. It's best to set No New Tasks or switch to another project until the problem is solved. |
Send message Joined: 29 Aug 05 Posts: 1048 Credit: 7,510,992 RAC: 7,518 |
|
Send message Joined: 18 Dec 15 Posts: 1749 Credit: 115,651,430 RAC: 87,430 |
The problem has been traced to an authentication certificate becoming invalid, for reasons as yet unknown. CERN IT are working on it. Thanks, Ivan, for the Information. Would you please inform us here when it works again - :-) |
Send message Joined: 29 Aug 05 Posts: 1048 Credit: 7,510,992 RAC: 7,518 |
|
Send message Joined: 18 Dec 15 Posts: 1749 Credit: 115,651,430 RAC: 87,430 |
CMS@Home jobs are available again. I have a task running well for 28 minutes now. All seems to be okay again :-) |
Send message Joined: 28 Sep 04 Posts: 711 Credit: 47,536,096 RAC: 31,760 |
I just got three tasks failing for the condor ping. I did finish one succesfully just before them (about 9 hours long). |
Send message Joined: 29 Aug 05 Posts: 1048 Credit: 7,510,992 RAC: 7,518 |
|
Send message Joined: 29 Aug 05 Posts: 1048 Credit: 7,510,992 RAC: 7,518 |
|
Send message Joined: 29 Aug 05 Posts: 1048 Credit: 7,510,992 RAC: 7,518 |
|
Send message Joined: 29 Aug 05 Posts: 1048 Credit: 7,510,992 RAC: 7,518 |
|
©2024 CERN