Message boards : CMS Application : CMS Tasks Failing
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 16 · 17 · 18 · 19 · 20 · 21 · 22 · Next

AuthorMessage
Jim1348

Send message
Joined: 15 Nov 14
Posts: 602
Credit: 24,371,321
RAC: 0
Message 42855 - Posted: 12 Jun 2020, 22:39:11 UTC - in response to Message 42849.  

OK, here's the analysis. Executive summary: jobs are aborted if loss of communication exceeds 20 minutes, but the task does not see this as a failure..

Could that be caused by a reboot? I don't recall having that problem with CMS before, but I lost two of them this afternoon, apparently around the time of a reboot.
https://lhcathome.cern.ch/lhcathome/result.php?resultid=277291625
https://lhcathome.cern.ch/lhcathome/result.php?resultid=277292618
ID: 42855 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1712
Credit: 106,232,895
RAC: 73,745
Message 42857 - Posted: 13 Jun 2020, 4:46:27 UTC

since last night, the tasks are failing again with 206 (0x000000CE) EXIT_INIT_FAILURE

excerpt from the log:

2020-06-12 22:43:29 (118344): Guest Log: [DEBUG] DC_NOP failed!
2020-06-12 22:43:29 (118344): Guest Log: SECMAN:2006:Failed to establish a crypto key.
2020-06-12 22:43:29 (118344): Guest Log: 06/12/20 20:40:58 recognized DC_NOP as command name, using command 60011.
2020-06-12 22:43:29 (118344): Guest Log: 06/12/20 20:42:52 WARNING: globus returned with euid 0
2020-06-12 22:43:29 (118344): Guest Log: 06/12/20 20:42:58 SECMAN: enable_mac has no key to use, failing...
2020-06-12 22:43:48 (118344): Guest Log: [ERROR] Could not ping HTCondor.
2020-06-12 22:43:56 (118344): Guest Log: [INFO] Shutting Down.

The whole stderr is here: https://lhcathome.cern.ch/lhcathome/result.php?resultid=277447404

too bad, I was so hopeful that finally CMS is working well again :-(
ID: 42857 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1009
Credit: 6,335,613
RAC: 2,115
Message 42859 - Posted: 13 Jun 2020, 17:13:51 UTC - in response to Message 42855.  

OK, here's the analysis. Executive summary: jobs are aborted if loss of communication exceeds 20 minutes, but the task does not see this as a failure..

Could that be caused by a reboot? I don't recall having that problem with CMS before, but I lost two of them this afternoon, apparently around the time of a reboot.
https://lhcathome.cern.ch/lhcathome/result.php?resultid=277291625
https://lhcathome.cern.ch/lhcathome/result.php?resultid=277292618

Possibly, if you haven't made sure that BOINC is fully closed and VirtualBox finished before reboot. To successfully restart they need to have safely saved their configurations to disk.
ID: 42859 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1009
Credit: 6,335,613
RAC: 2,115
Message 42860 - Posted: 13 Jun 2020, 17:24:23 UTC - in response to Message 42857.  

Erich, that looks like a network failure at first glance. Usually the last [DEBUG] is 0. before the BOINC task starts.
A quick gwgl shows that this is something Laurence has experience with...
https://www-auth.cs.wisc.edu/lists/htcondor-users/2016-July/msg00003.shtmll
I trust it is transient, let us know again if not.
ID: 42860 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1712
Credit: 106,232,895
RAC: 73,745
Message 42863 - Posted: 13 Jun 2020, 19:56:46 UTC - in response to Message 42860.  

Erich, that looks like a network failure at first glance. ...
hm, that's interesting.
As a matter of fact, now also ATLAS tasks fail on all 3 machines on which I've tried them. They fail after about 10-12 minutes.
I've posted all details here:
https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=5439&postid=42858#42858

Strange enough, Theory tasks run well on all of my machines. Plus, I havn't noticed any network problems otherwise.
ID: 42863 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1307
Credit: 8,688,493
RAC: 5,089
Message 42919 - Posted: 28 Jun 2020, 6:24:51 UTC

Request X509 credentials fails:

ID: 42919 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 684
Credit: 44,176,939
RAC: 13,809
Message 42920 - Posted: 28 Jun 2020, 11:36:56 UTC - in response to Message 42919.  

I had a lot of those errors yesterday when the servers were down, but since they came up again yesterday afternoon (about 13:30 UTC) everything has been working OK.
ID: 42920 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1712
Credit: 106,232,895
RAC: 73,745
Message 42921 - Posted: 28 Jun 2020, 12:35:25 UTC - in response to Message 42920.  

I had a lot of those errors yesterday when the servers were down, but since they came up again yesterday afternoon (about 13:30 UTC) everything has been working OK.
same situation here, on all machines.
ID: 42921 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1307
Credit: 8,688,493
RAC: 5,089
Message 42922 - Posted: 28 Jun 2020, 14:00:27 UTC

Cause of your successes, I gave it a retry. No success for me: https://lhcathome.cern.ch/lhcathome/result.php?resultid=279017463
ID: 42922 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2150
Credit: 160,846,760
RAC: 50,441
Message 42923 - Posted: 28 Jun 2020, 14:26:57 UTC - in response to Message 42922.  

Crystal,
your task show this line at the beginning of the log:
<message>
De bestandsnaam of -extensie is te lang.
(0xce) - exit code 206 (0xce)</message>
<stderr_txt>
Do you have a own proxy?
ID: 42923 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1307
Credit: 8,688,493
RAC: 5,089
Message 42924 - Posted: 28 Jun 2020, 16:17:41 UTC - in response to Message 42923.  

I'm not using an own proxy. It's all BOINC default. At the same time (1st CMS-attempt), I was running Theory and ATLAS successfull.

Exit code 206 not always provides the right error message.
ID: 42924 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2150
Credit: 160,846,760
RAC: 50,441
Message 42926 - Posted: 28 Jun 2020, 17:51:53 UTC

Crystal,
the Microsoft Windows 10 Core x64 Edition, (10.00.19041.00) upgrade,
was this today?
This morning CMS-Tasks finished successful on this Computer?
ID: 42926 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1307
Credit: 8,688,493
RAC: 5,089
Message 42927 - Posted: 28 Jun 2020, 19:43:08 UTC - in response to Message 42926.  

the Microsoft Windows 10 Core x64 Edition, (10.00.19041.00) upgrade,
was this today?
This morning CMS-Tasks finished successful on this Computer?
The Win-update was on the 18th of June and a successful CMS was on the 22nd.
ID: 42927 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1307
Credit: 8,688,493
RAC: 5,089
Message 42928 - Posted: 29 Jun 2020, 5:49:46 UTC

A new day, a new try.
Now with success. The X509 credential request happened only once, but this time only for LHC@home and not for the vLHC@home-dev system.

Within 2 seconds the fast benchmark started and the startup continued successful.
Maybe a DNS-issue after the outage from the server downtime on Saturday or the problem was with vLHC@home-dev.
ID: 42928 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1712
Credit: 106,232,895
RAC: 73,745
Message 42933 - Posted: 1 Jul 2020, 5:11:44 UTC

Last night, several tasks failed.
There were two different error patterns:

1) failure after about 20 minutes with: 207 (0x000000CF) EXIT_NO_SUB_TASKS
see: https://lhcathome.cern.ch/lhcathome/result.php?resultid=279113738

2) failure after about 5 hours with: 1 (0x00000001) Unknown error code
see: https://lhcathome.cern.ch/lhcathome/result.php?resultid=279084319

what's going wrong?
ID: 42933 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1307
Credit: 8,688,493
RAC: 5,089
Message 42934 - Posted: 1 Jul 2020, 7:04:12 UTC - in response to Message 42933.  

what's going wrong?
I really don't know. My last task started OK, run over 12 hours elapsed and was invalid -> Exit status 1 (0x00000001) Unknown error code
2020-06-29 07:30:10 (8048): vboxwrapper (7.7.26197): starting
.
.
2020-06-29 19:04:49 (8048): Status Report: Elapsed Time: '30029.368874'
2020-06-29 19:04:49 (8048): Status Report: CPU Time: '29827.984375'
2020-06-29 19:59:52 (8048): Guest Log: [ERROR] Condor ended after 44524 seconds.
2020-06-29 19:59:52 (8048): Guest Log: [INFO] Shutting Down.
2020-06-29 19:59:52 (8048): VM Completion File Detected.
2020-06-29 19:59:52 (8048): VM Completion Message: Condor ended after 44524 seconds.
.
2020-06-29 19:59:52 (8048): Powering off VM.
2020-06-29 19:59:53 (8048): Successfully stopped VM.

Result: https://lhcathome.cern.ch/lhcathome/result.php?resultid=279039834

IMO the job result was uploaded OK by gfal_copy from the VM.
I don't touch CMS as long as problems aren't ironed out.
ID: 42934 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 684
Credit: 44,176,939
RAC: 13,809
Message 42935 - Posted: 1 Jul 2020, 7:39:51 UTC

Failures here also for all CMS tasks that have started this morning. Less than 10 to go and then they are all out of my system.
ID: 42935 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2443
Credit: 230,715,357
RAC: 122,429
Message 42936 - Posted: 1 Jul 2020, 7:51:12 UTC

It's caused by 2 errors.

The obvious one:
If a task starts but there are no subtasks available it fails with "EXIT_NO_SUB_TASKS".
This is the expected behavior.


Not so obvious:
A tasks starts and runs the 1st subtask.
Then it requests a 2nd subtask but there's no one in the queue.
In this case the task fails with "Unknown error code".



As far as I see in my statistics all goes fine if at least a 2nd subtask can be run.
This mess is older than the big bang and hidden deep in the interaction between htcondor, wmagent and CMS scripts and need to be fixed by the developers.
ID: 42936 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1009
Credit: 6,335,613
RAC: 2,115
Message 42937 - Posted: 1 Jul 2020, 9:52:48 UTC - in response to Message 42935.  

Failures here also for all CMS tasks that have started this morning. Less than 10 to go and then they are all out of my system.

Sorry, guys, something went wrong with the WMAgent overnight, and it stopped creating jobs -- so we ran out. It looks like some DB or cache space filled up and several components, including the JobCreator, failed. I've messaged people who can fix it.
The good news is that we have finally tracked down the cause of the problem where jobs were sent to the condor server with the explicit requirement not to run on a volunteer machine. It was basically a misunderstanding of how the python bindings into HTCondor work, but the code causing the problem is no longer needed and has been removed. This only just happened so I'm not sure when the patch will be deployed -- hopefully as they fix the current problem.
ID: 42937 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1009
Credit: 6,335,613
RAC: 2,115
Message 42938 - Posted: 1 Jul 2020, 13:13:27 UTC - in response to Message 42937.  
Last modified: 1 Jul 2020, 13:26:04 UTC

OK, we got our Oracle DB quota increased from 140 MB to 3 GB... The agent has been restarted and jobs should be available soon. I'm trying to check if there are still problems but it's a bit slow on my pathetic link.
[Later] OK, looking good, already 70 jobs running.[/Later]
ID: 42938 · Report as offensive     Reply Quote
Previous · 1 . . . 16 · 17 · 18 · 19 · 20 · 21 · 22 · Next

Message boards : CMS Application : CMS Tasks Failing


©2024 CERN