Thread 'CMS Tasks Failing'

Author	Message
Jim1348 Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0	Message 42855 - Posted: 12 Jun 2020, 22:39:11 UTC - in response to Message 42849. OK, here's the analysis. Executive summary: jobs are aborted if loss of communication exceeds 20 minutes, but the task does not see this as a failure.. Could that be caused by a reboot? I don't recall having that problem with CMS before, but I lost two of them this afternoon, apparently around the time of a reboot. https://lhcathome.cern.ch/lhcathome/result.php?resultid=277291625 https://lhcathome.cern.ch/lhcathome/result.php?resultid=277292618 ID: 42855 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1984 Credit: 161,640,283 RAC: 74,641	Message 42857 - Posted: 13 Jun 2020, 4:46:27 UTC since last night, the tasks are failing again with 206 (0x000000CE) EXIT_INIT_FAILURE excerpt from the log: 2020-06-12 22:43:29 (118344): Guest Log: [DEBUG] DC_NOP failed! 2020-06-12 22:43:29 (118344): Guest Log: SECMAN:2006:Failed to establish a crypto key. 2020-06-12 22:43:29 (118344): Guest Log: 06/12/20 20:40:58 recognized DC_NOP as command name, using command 60011. 2020-06-12 22:43:29 (118344): Guest Log: 06/12/20 20:42:52 WARNING: globus returned with euid 0 2020-06-12 22:43:29 (118344): Guest Log: 06/12/20 20:42:58 SECMAN: enable_mac has no key to use, failing... 2020-06-12 22:43:48 (118344): Guest Log: [ERROR] Could not ping HTCondor. 2020-06-12 22:43:56 (118344): Guest Log: [INFO] Shutting Down. The whole stderr is here: https://lhcathome.cern.ch/lhcathome/result.php?resultid=277447404 too bad, I was so hopeful that finally CMS is working well again :-( ID: 42857 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1158 Credit: 11,822,967 RAC: 6,128	Message 42859 - Posted: 13 Jun 2020, 17:13:51 UTC - in response to Message 42855. OK, here's the analysis. Executive summary: jobs are aborted if loss of communication exceeds 20 minutes, but the task does not see this as a failure.. Could that be caused by a reboot? I don't recall having that problem with CMS before, but I lost two of them this afternoon, apparently around the time of a reboot. https://lhcathome.cern.ch/lhcathome/result.php?resultid=277291625 https://lhcathome.cern.ch/lhcathome/result.php?resultid=277292618 Possibly, if you haven't made sure that BOINC is fully closed and VirtualBox finished before reboot. To successfully restart they need to have safely saved their configurations to disk. ID: 42859 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1158 Credit: 11,822,967 RAC: 6,128	Message 42860 - Posted: 13 Jun 2020, 17:24:23 UTC - in response to Message 42857. Erich, that looks like a network failure at first glance. Usually the last [DEBUG] is 0. before the BOINC task starts. A quick gwgl shows that this is something Laurence has experience with... https://www-auth.cs.wisc.edu/lists/htcondor-users/2016-July/msg00003.shtmll I trust it is transient, let us know again if not. ID: 42860 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1984 Credit: 161,640,283 RAC: 74,641	Message 42863 - Posted: 13 Jun 2020, 19:56:46 UTC - in response to Message 42860. Erich, that looks like a network failure at first glance. ... hm, that's interesting. As a matter of fact, now also ATLAS tasks fail on all 3 machines on which I've tried them. They fail after about 10-12 minutes. I've posted all details here: https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=5439&postid=42858#42858 Strange enough, Theory tasks run well on all of my machines. Plus, I havn't noticed any network problems otherwise. ID: 42863 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1554 Credit: 10,097,840 RAC: 1,811	Message 42919 - Posted: 28 Jun 2020, 6:24:51 UTC Request X509 credentials fails: ID: 42919 · Reply Quote

Harri Liljeroos Send message Joined: 28 Sep 04 Posts: 804 Credit: 65,911,742 RAC: 28,242	Message 42920 - Posted: 28 Jun 2020, 11:36:56 UTC - in response to Message 42919. I had a lot of those errors yesterday when the servers were down, but since they came up again yesterday afternoon (about 13:30 UTC) everything has been working OK. ID: 42920 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1984 Credit: 161,640,283 RAC: 74,641	Message 42921 - Posted: 28 Jun 2020, 12:35:25 UTC - in response to Message 42920. I had a lot of those errors yesterday when the servers were down, but since they came up again yesterday afternoon (about 13:30 UTC) everything has been working OK. same situation here, on all machines. ID: 42921 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1554 Credit: 10,097,840 RAC: 1,811	Message 42922 - Posted: 28 Jun 2020, 14:00:27 UTC Cause of your successes, I gave it a retry. No success for me: https://lhcathome.cern.ch/lhcathome/result.php?resultid=279017463 ID: 42922 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2304 Credit: 179,722,395 RAC: 27,537	Message 42923 - Posted: 28 Jun 2020, 14:26:57 UTC - in response to Message 42922. Crystal, your task show this line at the beginning of the log: <message> De bestandsnaam of -extensie is te lang. (0xce) - exit code 206 (0xce)</message> <stderr_txt> Do you have a own proxy? ID: 42923 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1554 Credit: 10,097,840 RAC: 1,811	Message 42924 - Posted: 28 Jun 2020, 16:17:41 UTC - in response to Message 42923. I'm not using an own proxy. It's all BOINC default. At the same time (1st CMS-attempt), I was running Theory and ATLAS successfull. Exit code 206 not always provides the right error message. ID: 42924 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2304 Credit: 179,722,395 RAC: 27,537	Message 42926 - Posted: 28 Jun 2020, 17:51:53 UTC Crystal, the Microsoft Windows 10 Core x64 Edition, (10.00.19041.00) upgrade, was this today? This morning CMS-Tasks finished successful on this Computer? ID: 42926 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1554 Credit: 10,097,840 RAC: 1,811	Message 42927 - Posted: 28 Jun 2020, 19:43:08 UTC - in response to Message 42926. the Microsoft Windows 10 Core x64 Edition, (10.00.19041.00) upgrade, was this today? This morning CMS-Tasks finished successful on this Computer? The Win-update was on the 18th of June and a successful CMS was on the 22nd. ID: 42927 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1554 Credit: 10,097,840 RAC: 1,811	Message 42928 - Posted: 29 Jun 2020, 5:49:46 UTC A new day, a new try. Now with success. The X509 credential request happened only once, but this time only for LHC@home and not for the vLHC@home-dev system. Within 2 seconds the fast benchmark started and the startup continued successful. Maybe a DNS-issue after the outage from the server downtime on Saturday or the problem was with vLHC@home-dev. ID: 42928 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1984 Credit: 161,640,283 RAC: 74,641	Message 42933 - Posted: 1 Jul 2020, 5:11:44 UTC Last night, several tasks failed. There were two different error patterns: 1) failure after about 20 minutes with: 207 (0x000000CF) EXIT_NO_SUB_TASKS see: https://lhcathome.cern.ch/lhcathome/result.php?resultid=279113738 2) failure after about 5 hours with: 1 (0x00000001) Unknown error code see: https://lhcathome.cern.ch/lhcathome/result.php?resultid=279084319 what's going wrong? ID: 42933 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1554 Credit: 10,097,840 RAC: 1,811	Message 42934 - Posted: 1 Jul 2020, 7:04:12 UTC - in response to Message 42933. what's going wrong? I really don't know. My last task started OK, run over 12 hours elapsed and was invalid -> Exit status 1 (0x00000001) Unknown error code 2020-06-29 07:30:10 (8048): vboxwrapper (7.7.26197): starting . . 2020-06-29 19:04:49 (8048): Status Report: Elapsed Time: '30029.368874' 2020-06-29 19:04:49 (8048): Status Report: CPU Time: '29827.984375' 2020-06-29 19:59:52 (8048): Guest Log: [ERROR] Condor ended after 44524 seconds. 2020-06-29 19:59:52 (8048): Guest Log: [INFO] Shutting Down. 2020-06-29 19:59:52 (8048): VM Completion File Detected. 2020-06-29 19:59:52 (8048): VM Completion Message: Condor ended after 44524 seconds. . 2020-06-29 19:59:52 (8048): Powering off VM. 2020-06-29 19:59:53 (8048): Successfully stopped VM. Result: https://lhcathome.cern.ch/lhcathome/result.php?resultid=279039834 IMO the job result was uploaded OK by gfal_copy from the VM. I don't touch CMS as long as problems aren't ironed out. ID: 42934 · Reply Quote

Harri Liljeroos Send message Joined: 28 Sep 04 Posts: 804 Credit: 65,911,742 RAC: 28,242	Message 42935 - Posted: 1 Jul 2020, 7:39:51 UTC Failures here also for all CMS tasks that have started this morning. Less than 10 to go and then they are all out of my system. ID: 42935 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2753 Credit: 303,616,542 RAC: 105,984	Message 42936 - Posted: 1 Jul 2020, 7:51:12 UTC It's caused by 2 errors. The obvious one: If a task starts but there are no subtasks available it fails with "EXIT_NO_SUB_TASKS". This is the expected behavior. Not so obvious: A tasks starts and runs the 1st subtask. Then it requests a 2nd subtask but there's no one in the queue. In this case the task fails with "Unknown error code". As far as I see in my statistics all goes fine if at least a 2nd subtask can be run. This mess is older than the big bang and hidden deep in the interaction between htcondor, wmagent and CMS scripts and need to be fixed by the developers. ID: 42936 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1158 Credit: 11,822,967 RAC: 6,128	Message 42937 - Posted: 1 Jul 2020, 9:52:48 UTC - in response to Message 42935. Failures here also for all CMS tasks that have started this morning. Less than 10 to go and then they are all out of my system. Sorry, guys, something went wrong with the WMAgent overnight, and it stopped creating jobs -- so we ran out. It looks like some DB or cache space filled up and several components, including the JobCreator, failed. I've messaged people who can fix it. The good news is that we have finally tracked down the cause of the problem where jobs were sent to the condor server with the explicit requirement not to run on a volunteer machine. It was basically a misunderstanding of how the python bindings into HTCondor work, but the code causing the problem is no longer needed and has been removed. This only just happened so I'm not sure when the patch will be deployed -- hopefully as they fix the current problem. ID: 42937 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1158 Credit: 11,822,967 RAC: 6,128	Message 42938 - Posted: 1 Jul 2020, 13:13:27 UTC - in response to Message 42937. Last modified: 1 Jul 2020, 13:26:04 UTC OK, we got our Oracle DB quota increased from 140 MB to 3 GB... The agent has been restarted and jobs should be available soon. I'm trying to check if there are still problems but it's a bit slow on my pathetic link. [Later] OK, looking good, already 70 jobs running.[/Later] ID: 42938 · Reply Quote