Message boards : LHCb Application : EXIT_INIT_FAILURE 206, check here if there is work
Message board moderation

To post messages, you must log in.

AuthorMessage
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 798
Credit: 644,769,949
RAC: 232,021
Message 30104 - Posted: 29 Apr 2017, 8:18:53 UTC

To check if there is work look at this status page:

http://lhcathomedev.cern.ch/lhcathome-dev/lhcb_job.php
ID: 30104 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,954,549
RAC: 136,930
Message 30170 - Posted: 3 May 2017, 6:09:34 UTC

What makes me wonder is how the grafic has to be interpreted.
If you look at the timestamp 2017-05-03:00:00 (last midnight) the green line shows 100 jobs.
Check that point in a few hours and you will see the number of job rising although the timestamp is in the past.

Can anybody explain that?
ID: 30170 · Report as offensive     Reply Quote
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 798
Credit: 644,769,949
RAC: 232,021
Message 30173 - Posted: 3 May 2017, 6:16:48 UTC

I look at the current being 1.02 this is almost zero. not sure about what else to take from it.
ID: 30173 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,954,549
RAC: 136,930
Message 30174 - Posted: 3 May 2017, 6:40:10 UTC

I also interpret this as "no jobs available".
Yesterday morning the most recent entry was also 0.
But look at the graph now. It shows more than 300 jobs at that timestamp.
ID: 30174 · Report as offensive     Reply Quote
Luca Tomassetti

Send message
Joined: 26 Apr 17
Posts: 7
Credit: 22,463
RAC: 0
Message 30175 - Posted: 3 May 2017, 6:42:59 UTC - in response to Message 30173.  

The plot is generated from accounting data. This introduces some delay from the moment a job finishes in your VM to the moment outputs are further managed and status set.
In addition (sigh) last point to the right is always 0. That doesn't mean that jobs are not available and/or are not running/finishing.
For instance, at the moment there are ~150 jobs running.
L
ID: 30175 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,954,549
RAC: 136,930
Message 30176 - Posted: 3 May 2017, 7:16:24 UTC - in response to Message 30175.  

Thank you Luca.

Where can a normal user see if there are jobs ready to send (and how much)?
I don't write about the server status page as there you can only see the number of available WUs.
If you start a WU and there are no jobs available you will get an EXIT_INIT_FAILURE.
This is what has to be avoided.
ID: 30176 · Report as offensive     Reply Quote
Luca Tomassetti

Send message
Joined: 26 Apr 17
Posts: 7
Credit: 22,463
RAC: 0
Message 30183 - Posted: 3 May 2017, 17:29:14 UTC - in response to Message 30176.  

Hi,
now plots should be more reliable (still with extrapolation to 0 to the right).
Just fixed an issue in the post-processing that slowed-down a lot the status-update (and consequently the plots).

Still investigating the issue with EXIT_INIT_FAILURE:
in principle there should always be jobs to be picked up from the VMs, apart from temporary issues which is not the case these days. LHCb do not pre-select jobs to be sent to the community, you pick-up jobs from the same 'queue' as all other sites.

I'll report asap on this.
ID: 30183 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,954,549
RAC: 136,930
Message 30187 - Posted: 4 May 2017, 5:48:42 UTC

Got 2 jobs on each of my hosts this morning.
Looks ok so far.
ID: 30187 · Report as offensive     Reply Quote
Luca Tomassetti

Send message
Joined: 26 Apr 17
Posts: 7
Credit: 22,463
RAC: 0
Message 30202 - Posted: 4 May 2017, 12:22:40 UTC - in response to Message 30187.  

Hi,

the issue with 206 error should also be mitigated now (since yesterday night).
It was a glitch on the boinc server-side which prevented to send workloads to the VMs even if LHCb had availability of jobs.

Please, try to run LHCb jobs!

Cheers,
Luca
ID: 30202 · Report as offensive     Reply Quote
djoser
Avatar

Send message
Joined: 30 Aug 14
Posts: 145
Credit: 10,847,070
RAC: 0
Message 32671 - Posted: 7 Oct 2017, 11:53:11 UTC
Last modified: 7 Oct 2017, 11:55:43 UTC

My dedicated LHCb machine produces nothing but EXIT_INIT_FAILURE 206 since today.
Anything wrong again with Boinc, or is project out of work right now?

Found these lines in the logfile:

2017-10-07 13:49:29 (3059): Guest Log: ERROR: Couldn't read proxy from: /tmp/x509up_u0
2017-10-07 13:49:29 (3059): Guest Log: globus_credential: Error reading proxy credential
2017-10-07 13:49:29 (3059): Guest Log: globus_credential: Error reading proxy credential: Couldn't read PEM from bio
2017-10-07 13:49:29 (3059): Guest Log: OpenSSL Error: pem_lib.c:703: in library: PEM routines, function PEM_read_bio: no start line
2017-10-07 13:49:29 (3059): Guest Log: Use -debug for further information.
2017-10-07 13:49:29 (3059): Guest Log: [ERROR] Could not get an x509 credential
2017-10-07 13:49:29 (3059): Guest Log: [ERROR] The x509 proxy creation failed.

Greetz, djoser.
Why mine when you can research? - GRIDCOIN - Real cryptocurrency without wasting hashes! https://gridcoin.us
ID: 32671 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1268
Credit: 8,421,616
RAC: 2,139
Message 32673 - Posted: 7 Oct 2017, 12:16:59 UTC - in response to Message 32671.  

2017-10-07 13:49:29 (3059): Guest Log: [ERROR] Could not get an x509 credential
2017-10-07 13:49:29 (3059): Guest Log: [ERROR] The x509 proxy creation failed.

Greetz, djoser.

Your VM cannot make contact to CERN server, cause the authentication failed.

The problem is at the project site and since it's weekend we probably have to wait until Monday.

I've the same problem with the Theory tasks.
ID: 32673 · Report as offensive     Reply Quote
djoser
Avatar

Send message
Joined: 30 Aug 14
Posts: 145
Credit: 10,847,070
RAC: 0
Message 32674 - Posted: 7 Oct 2017, 12:20:19 UTC - in response to Message 32673.  
Last modified: 7 Oct 2017, 12:20:43 UTC

Thanks for your answer.
I wonder what projects are affected.
So far i know about LHCb and Theory tasks.
Set my machine to no new work for the moment.

Regards, djoser.
Why mine when you can research? - GRIDCOIN - Real cryptocurrency without wasting hashes! https://gridcoin.us
ID: 32674 · Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer

Send message
Joined: 20 Jun 14
Posts: 372
Credit: 238,712
RAC: 0
Message 32679 - Posted: 7 Oct 2017, 21:11:25 UTC - in response to Message 32673.  

The problem should now be fixed.
ID: 32679 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 674
Credit: 43,152,472
RAC: 15,698
Message 32966 - Posted: 2 Nov 2017, 14:54:05 UTC

A lot of task failures with the error in the title. Like here: https://lhcathome.cern.ch/lhcathome/result.php?resultid=163450255

And also with error 207 (0x000000CF) EXIT_NO_SUB_TASKS. Like here: https://lhcathome.cern.ch/lhcathome/result.php?resultid=163449631
ID: 32966 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,395,668
RAC: 102,181
Message 32994 - Posted: 5 Nov 2017, 14:27:33 UTC - in response to Message 32966.  

A lot of task failures with the error in the title. Like here: https://lhcathome.cern.ch/lhcathome/result.php?resultid=163450255

And also with error 207 (0x000000CF) EXIT_NO_SUB_TASKS. Like here: https://lhcathome.cern.ch/lhcathome/result.php?resultid=163449631


I've been experiencing exactly the same problems with CMS tasks during the past 4-5 days.

In most of the cases, one could read somewhere in the stderr text of the failed task (in different variations) that the connection to the Condor Server was not possible.

Since LHCb also needs to connect to the Condor Server, these problems won't disappear as long as the Condor Server problem is not being fixed.
ID: 32994 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,395,668
RAC: 102,181
Message 32997 - Posted: 5 Nov 2017, 16:37:06 UTC

just a few minutes ago, I had three LHCb tasks in a row which errored out after 10-14 minutes, with stderr:

2017-11-05 17:13:29 (5728): Guest Log: [DEBUG] Testing connection to Condor server on port 9618
2017-11-05 17:13:59 (5728): Guest Log: [DEBUG] nc: connect to vccondor01.cern.ch port 9618 (tcp) timed out: Operation now in progress
2017-11-05 17:13:59 (5728): Guest Log: [DEBUG] 1
2017-11-05 17:13:59 (5728): Guest Log: [ERROR] Could not connect to Condor server on port 9618
2017-11-05 17:13:59 (5728): Guest Log: [INFO] Shutting Down.
ID: 32997 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,395,668
RAC: 102,181
Message 32998 - Posted: 5 Nov 2017, 17:37:58 UTC

a minute ago, a task errored out after 19 minutes with

207 (0x000000CF) EXIT_NO_SUB_TASKS

more details: https://lhcathome.cern.ch/lhcathome/result.php?resultid=163688697

In fact, all these errors are the same which I have with CMS tasks.

Is there, all of a sudden, something wrong with my systems(s)? I don't think so, though, since other crunchers are reporting the same errors.

When will someone at LHC look into these problems?
ID: 32998 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,395,668
RAC: 102,181
Message 32999 - Posted: 5 Nov 2017, 17:43:02 UTC

and now the next one with

207 (0x000000CF) EXIT_NO_SUB_TASKS

erroring out after 8 minutes.

more Details: https://lhcathome.cern.ch/lhcathome/result.php?resultid=163692055

Excerpt:

2017-11-05 18:30:40 (1708): Guest Log: [DEBUG] DC_NOP failed!
2017-11-05 18:30:40 (1708): Guest Log: SECMAN:2007:Failed to end classad message.
2017-11-05 18:30:40 (1708): Guest Log: 11/05/17 18:30:31 recognized DC_NOP as command name, using command 60011.
2017-11-05 18:30:40 (1708): Guest Log: 11/05/17 18:30:52 SECMAN: no classad from server, failing
2017-11-05 18:30:43 (1708): Guest Log: [ERROR] Could not ping HTCondor.
2017-11-05 18:30:43 (1708): Guest Log: [INFO] Shutting Down.


What's wrong with the Condor Server?
ID: 32999 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 674
Credit: 43,152,472
RAC: 15,698
Message 33001 - Posted: 5 Nov 2017, 18:57:37 UTC

If you check the graphics of CMS jobs here:https://lhcathomedev.cern.ch/lhcathome-dev/cms_job.php and LHCb jobs here:https://lhcathomedev.cern.ch/lhcathome-dev/lhcb_job.php, you can see that quite a lot of jobs are being crunched by someone. So the connection to Condor is working for some people. Sadly many of us seem to get failures most of the time. Not easy to find the problem, I think.

Imagine how much more work could be done if the connection was reliable.
ID: 33001 · Report as offensive     Reply Quote

Message boards : LHCb Application : EXIT_INIT_FAILURE 206, check here if there is work


©2024 CERN