Message boards :
CMS Application :
CMS Tasks Failing
Message board moderation
Previous · 1 . . . 16 · 17 · 18 · 19 · 20 · 21 · 22 · Next
Author | Message |
---|---|
Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0 |
OK, here's the analysis. Executive summary: jobs are aborted if loss of communication exceeds 20 minutes, but the task does not see this as a failure.. Could that be caused by a reboot? I don't recall having that problem with CMS before, but I lost two of them this afternoon, apparently around the time of a reboot. https://lhcathome.cern.ch/lhcathome/result.php?resultid=277291625 https://lhcathome.cern.ch/lhcathome/result.php?resultid=277292618 |
Send message Joined: 18 Dec 15 Posts: 1827 Credit: 119,535,315 RAC: 42,794 |
since last night, the tasks are failing again with 206 (0x000000CE) EXIT_INIT_FAILURE excerpt from the log: 2020-06-12 22:43:29 (118344): Guest Log: [DEBUG] DC_NOP failed! 2020-06-12 22:43:29 (118344): Guest Log: SECMAN:2006:Failed to establish a crypto key. 2020-06-12 22:43:29 (118344): Guest Log: 06/12/20 20:40:58 recognized DC_NOP as command name, using command 60011. 2020-06-12 22:43:29 (118344): Guest Log: 06/12/20 20:42:52 WARNING: globus returned with euid 0 2020-06-12 22:43:29 (118344): Guest Log: 06/12/20 20:42:58 SECMAN: enable_mac has no key to use, failing... 2020-06-12 22:43:48 (118344): Guest Log: [ERROR] Could not ping HTCondor. 2020-06-12 22:43:56 (118344): Guest Log: [INFO] Shutting Down. The whole stderr is here: https://lhcathome.cern.ch/lhcathome/result.php?resultid=277447404 too bad, I was so hopeful that finally CMS is working well again :-( |
Send message Joined: 29 Aug 05 Posts: 1065 Credit: 7,884,757 RAC: 11,261 |
OK, here's the analysis. Executive summary: jobs are aborted if loss of communication exceeds 20 minutes, but the task does not see this as a failure.. Possibly, if you haven't made sure that BOINC is fully closed and VirtualBox finished before reboot. To successfully restart they need to have safely saved their configurations to disk. |
Send message Joined: 29 Aug 05 Posts: 1065 Credit: 7,884,757 RAC: 11,261 |
Erich, that looks like a network failure at first glance. Usually the last [DEBUG] is 0. before the BOINC task starts. A quick gwgl shows that this is something Laurence has experience with... https://www-auth.cs.wisc.edu/lists/htcondor-users/2016-July/msg00003.shtmll I trust it is transient, let us know again if not. |
Send message Joined: 18 Dec 15 Posts: 1827 Credit: 119,535,315 RAC: 42,794 |
Erich, that looks like a network failure at first glance. ...hm, that's interesting. As a matter of fact, now also ATLAS tasks fail on all 3 machines on which I've tried them. They fail after about 10-12 minutes. I've posted all details here: https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=5439&postid=42858#42858 Strange enough, Theory tasks run well on all of my machines. Plus, I havn't noticed any network problems otherwise. |
Send message Joined: 14 Jan 10 Posts: 1427 Credit: 9,492,726 RAC: 820 |
Request X509 credentials fails: |
Send message Joined: 28 Sep 04 Posts: 735 Credit: 49,840,876 RAC: 35,261 |
I had a lot of those errors yesterday when the servers were down, but since they came up again yesterday afternoon (about 13:30 UTC) everything has been working OK. |
Send message Joined: 18 Dec 15 Posts: 1827 Credit: 119,535,315 RAC: 42,794 |
I had a lot of those errors yesterday when the servers were down, but since they came up again yesterday afternoon (about 13:30 UTC) everything has been working OK.same situation here, on all machines. |
Send message Joined: 14 Jan 10 Posts: 1427 Credit: 9,492,726 RAC: 820 |
Cause of your successes, I gave it a retry. No success for me: https://lhcathome.cern.ch/lhcathome/result.php?resultid=279017463 |
Send message Joined: 2 May 07 Posts: 2244 Credit: 173,988,818 RAC: 7,494 |
Crystal, your task show this line at the beginning of the log: <message> De bestandsnaam of -extensie is te lang. (0xce) - exit code 206 (0xce)</message> <stderr_txt> Do you have a own proxy? |
Send message Joined: 14 Jan 10 Posts: 1427 Credit: 9,492,726 RAC: 820 |
I'm not using an own proxy. It's all BOINC default. At the same time (1st CMS-attempt), I was running Theory and ATLAS successfull. Exit code 206 not always provides the right error message. |
Send message Joined: 2 May 07 Posts: 2244 Credit: 173,988,818 RAC: 7,494 |
Crystal, the Microsoft Windows 10 Core x64 Edition, (10.00.19041.00) upgrade, was this today? This morning CMS-Tasks finished successful on this Computer? |
Send message Joined: 14 Jan 10 Posts: 1427 Credit: 9,492,726 RAC: 820 |
the Microsoft Windows 10 Core x64 Edition, (10.00.19041.00) upgrade,The Win-update was on the 18th of June and a successful CMS was on the 22nd. |
Send message Joined: 14 Jan 10 Posts: 1427 Credit: 9,492,726 RAC: 820 |
A new day, a new try. Now with success. The X509 credential request happened only once, but this time only for LHC@home and not for the vLHC@home-dev system. Within 2 seconds the fast benchmark started and the startup continued successful. Maybe a DNS-issue after the outage from the server downtime on Saturday or the problem was with vLHC@home-dev. |
Send message Joined: 18 Dec 15 Posts: 1827 Credit: 119,535,315 RAC: 42,794 |
Last night, several tasks failed. There were two different error patterns: 1) failure after about 20 minutes with: 207 (0x000000CF) EXIT_NO_SUB_TASKS see: https://lhcathome.cern.ch/lhcathome/result.php?resultid=279113738 2) failure after about 5 hours with: 1 (0x00000001) Unknown error code see: https://lhcathome.cern.ch/lhcathome/result.php?resultid=279084319 what's going wrong? |
Send message Joined: 14 Jan 10 Posts: 1427 Credit: 9,492,726 RAC: 820 |
what's going wrong?I really don't know. My last task started OK, run over 12 hours elapsed and was invalid -> Exit status 1 (0x00000001) Unknown error code 2020-06-29 07:30:10 (8048): vboxwrapper (7.7.26197): starting . . 2020-06-29 19:04:49 (8048): Status Report: Elapsed Time: '30029.368874' 2020-06-29 19:04:49 (8048): Status Report: CPU Time: '29827.984375' 2020-06-29 19:59:52 (8048): Guest Log: [ERROR] Condor ended after 44524 seconds. 2020-06-29 19:59:52 (8048): Guest Log: [INFO] Shutting Down. 2020-06-29 19:59:52 (8048): VM Completion File Detected. 2020-06-29 19:59:52 (8048): VM Completion Message: Condor ended after 44524 seconds. . 2020-06-29 19:59:52 (8048): Powering off VM. 2020-06-29 19:59:53 (8048): Successfully stopped VM. Result: https://lhcathome.cern.ch/lhcathome/result.php?resultid=279039834 IMO the job result was uploaded OK by gfal_copy from the VM. I don't touch CMS as long as problems aren't ironed out. |
Send message Joined: 28 Sep 04 Posts: 735 Credit: 49,840,876 RAC: 35,261 |
Failures here also for all CMS tasks that have started this morning. Less than 10 to go and then they are all out of my system. |
Send message Joined: 15 Jun 08 Posts: 2549 Credit: 255,268,827 RAC: 57,132 |
It's caused by 2 errors. The obvious one: If a task starts but there are no subtasks available it fails with "EXIT_NO_SUB_TASKS". This is the expected behavior. Not so obvious: A tasks starts and runs the 1st subtask. Then it requests a 2nd subtask but there's no one in the queue. In this case the task fails with "Unknown error code". As far as I see in my statistics all goes fine if at least a 2nd subtask can be run. This mess is older than the big bang and hidden deep in the interaction between htcondor, wmagent and CMS scripts and need to be fixed by the developers. |
Send message Joined: 29 Aug 05 Posts: 1065 Credit: 7,884,757 RAC: 11,261 |
Failures here also for all CMS tasks that have started this morning. Less than 10 to go and then they are all out of my system. Sorry, guys, something went wrong with the WMAgent overnight, and it stopped creating jobs -- so we ran out. It looks like some DB or cache space filled up and several components, including the JobCreator, failed. I've messaged people who can fix it. The good news is that we have finally tracked down the cause of the problem where jobs were sent to the condor server with the explicit requirement not to run on a volunteer machine. It was basically a misunderstanding of how the python bindings into HTCondor work, but the code causing the problem is no longer needed and has been removed. This only just happened so I'm not sure when the patch will be deployed -- hopefully as they fix the current problem. |
Send message Joined: 29 Aug 05 Posts: 1065 Credit: 7,884,757 RAC: 11,261 |
|
©2025 CERN