Thread 'EXIT_INIT_FAILURE 206, check here if there is work'

Author	Message
Toby Broom Volunteer moderator Send message Joined: 27 Sep 08 Posts: 946 Credit: 782,854,190 RAC: 127,972	Message 30104 - Posted: 29 Apr 2017, 8:18:53 UTC To check if there is work look at this status page: http://lhcathomedev.cern.ch/lhcathome-dev/lhcb_job.php ID: 30104 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2753 Credit: 303,251,205 RAC: 95,995	Message 30170 - Posted: 3 May 2017, 6:09:34 UTC What makes me wonder is how the grafic has to be interpreted. If you look at the timestamp 2017-05-03:00:00 (last midnight) the green line shows 100 jobs. Check that point in a few hours and you will see the number of job rising although the timestamp is in the past. Can anybody explain that? ID: 30170 · Reply Quote

Toby Broom Volunteer moderator Send message Joined: 27 Sep 08 Posts: 946 Credit: 782,854,190 RAC: 127,972	Message 30173 - Posted: 3 May 2017, 6:16:48 UTC I look at the current being 1.02 this is almost zero. not sure about what else to take from it. ID: 30173 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2753 Credit: 303,251,205 RAC: 95,995	Message 30174 - Posted: 3 May 2017, 6:40:10 UTC I also interpret this as "no jobs available". Yesterday morning the most recent entry was also 0. But look at the graph now. It shows more than 300 jobs at that timestamp. ID: 30174 · Reply Quote

Luca Tomassetti Send message Joined: 26 Apr 17 Posts: 7 Credit: 22,463 RAC: 0	Message 30175 - Posted: 3 May 2017, 6:42:59 UTC - in response to Message 30173. The plot is generated from accounting data. This introduces some delay from the moment a job finishes in your VM to the moment outputs are further managed and status set. In addition (sigh) last point to the right is always 0. That doesn't mean that jobs are not available and/or are not running/finishing. For instance, at the moment there are ~150 jobs running. L ID: 30175 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2753 Credit: 303,251,205 RAC: 95,995	Message 30176 - Posted: 3 May 2017, 7:16:24 UTC - in response to Message 30175. Thank you Luca. Where can a normal user see if there are jobs ready to send (and how much)? I don't write about the server status page as there you can only see the number of available WUs. If you start a WU and there are no jobs available you will get an EXIT_INIT_FAILURE. This is what has to be avoided. ID: 30176 · Reply Quote

Luca Tomassetti Send message Joined: 26 Apr 17 Posts: 7 Credit: 22,463 RAC: 0	Message 30183 - Posted: 3 May 2017, 17:29:14 UTC - in response to Message 30176. Hi, now plots should be more reliable (still with extrapolation to 0 to the right). Just fixed an issue in the post-processing that slowed-down a lot the status-update (and consequently the plots). Still investigating the issue with EXIT_INIT_FAILURE: in principle there should always be jobs to be picked up from the VMs, apart from temporary issues which is not the case these days. LHCb do not pre-select jobs to be sent to the community, you pick-up jobs from the same 'queue' as all other sites. I'll report asap on this. ID: 30183 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2753 Credit: 303,251,205 RAC: 95,995	Message 30187 - Posted: 4 May 2017, 5:48:42 UTC Got 2 jobs on each of my hosts this morning. Looks ok so far. ID: 30187 · Reply Quote

Luca Tomassetti Send message Joined: 26 Apr 17 Posts: 7 Credit: 22,463 RAC: 0	Message 30202 - Posted: 4 May 2017, 12:22:40 UTC - in response to Message 30187. Hi, the issue with 206 error should also be mitigated now (since yesterday night). It was a glitch on the boinc server-side which prevented to send workloads to the VMs even if LHCb had availability of jobs. Please, try to run LHCb jobs! Cheers, Luca ID: 30202 · Reply Quote

djoser Send message Joined: 30 Aug 14 Posts: 145 Credit: 10,847,070 RAC: 0	Message 32671 - Posted: 7 Oct 2017, 11:53:11 UTC Last modified: 7 Oct 2017, 11:55:43 UTC My dedicated LHCb machine produces nothing but EXIT_INIT_FAILURE 206 since today. Anything wrong again with Boinc, or is project out of work right now? Found these lines in the logfile: 2017-10-07 13:49:29 (3059): Guest Log: ERROR: Couldn't read proxy from: /tmp/x509up_u0 2017-10-07 13:49:29 (3059): Guest Log: globus_credential: Error reading proxy credential 2017-10-07 13:49:29 (3059): Guest Log: globus_credential: Error reading proxy credential: Couldn't read PEM from bio 2017-10-07 13:49:29 (3059): Guest Log: OpenSSL Error: pem_lib.c:703: in library: PEM routines, function PEM_read_bio: no start line 2017-10-07 13:49:29 (3059): Guest Log: Use -debug for further information. 2017-10-07 13:49:29 (3059): Guest Log: [ERROR] Could not get an x509 credential 2017-10-07 13:49:29 (3059): Guest Log: [ERROR] The x509 proxy creation failed. Greetz, djoser. Why mine when you can research? - GRIDCOIN - Real cryptocurrency without wasting hashes! https://gridcoin.us ID: 32671 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1553 Credit: 10,094,171 RAC: 1,912	Message 32673 - Posted: 7 Oct 2017, 12:16:59 UTC - in response to Message 32671. 2017-10-07 13:49:29 (3059): Guest Log: [ERROR] Could not get an x509 credential 2017-10-07 13:49:29 (3059): Guest Log: [ERROR] The x509 proxy creation failed. Greetz, djoser. Your VM cannot make contact to CERN server, cause the authentication failed. The problem is at the project site and since it's weekend we probably have to wait until Monday. I've the same problem with the Theory tasks. ID: 32673 · Reply Quote

djoser Send message Joined: 30 Aug 14 Posts: 145 Credit: 10,847,070 RAC: 0	Message 32674 - Posted: 7 Oct 2017, 12:20:19 UTC - in response to Message 32673. Last modified: 7 Oct 2017, 12:20:43 UTC Thanks for your answer. I wonder what projects are affected. So far i know about LHCb and Theory tasks. Set my machine to no new work for the moment. Regards, djoser. Why mine when you can research? - GRIDCOIN - Real cryptocurrency without wasting hashes! https://gridcoin.us ID: 32674 · Reply Quote

Laurence Project administrator Project developer Send message Joined: 20 Jun 14 Posts: 431 Credit: 255,793 RAC: 48	Message 32679 - Posted: 7 Oct 2017, 21:11:25 UTC - in response to Message 32673. The problem should now be fixed. ID: 32679 · Reply Quote

Harri Liljeroos Send message Joined: 28 Sep 04 Posts: 804 Credit: 65,825,918 RAC: 26,855	Message 32966 - Posted: 2 Nov 2017, 14:54:05 UTC A lot of task failures with the error in the title. Like here: https://lhcathome.cern.ch/lhcathome/result.php?resultid=163450255 And also with error 207 (0x000000CF) EXIT_NO_SUB_TASKS. Like here: https://lhcathome.cern.ch/lhcathome/result.php?resultid=163449631 ID: 32966 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1984 Credit: 161,370,607 RAC: 66,291	Message 32994 - Posted: 5 Nov 2017, 14:27:33 UTC - in response to Message 32966. A lot of task failures with the error in the title. Like here: https://lhcathome.cern.ch/lhcathome/result.php?resultid=163450255 And also with error 207 (0x000000CF) EXIT_NO_SUB_TASKS. Like here: https://lhcathome.cern.ch/lhcathome/result.php?resultid=163449631 I've been experiencing exactly the same problems with CMS tasks during the past 4-5 days. In most of the cases, one could read somewhere in the stderr text of the failed task (in different variations) that the connection to the Condor Server was not possible. Since LHCb also needs to connect to the Condor Server, these problems won't disappear as long as the Condor Server problem is not being fixed. ID: 32994 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1984 Credit: 161,370,607 RAC: 66,291	Message 32997 - Posted: 5 Nov 2017, 16:37:06 UTC just a few minutes ago, I had three LHCb tasks in a row which errored out after 10-14 minutes, with stderr: 2017-11-05 17:13:29 (5728): Guest Log: [DEBUG] Testing connection to Condor server on port 9618 2017-11-05 17:13:59 (5728): Guest Log: [DEBUG] nc: connect to vccondor01.cern.ch port 9618 (tcp) timed out: Operation now in progress 2017-11-05 17:13:59 (5728): Guest Log: [DEBUG] 1 2017-11-05 17:13:59 (5728): Guest Log: [ERROR] Could not connect to Condor server on port 9618 2017-11-05 17:13:59 (5728): Guest Log: [INFO] Shutting Down. ID: 32997 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1984 Credit: 161,370,607 RAC: 66,291	Message 32998 - Posted: 5 Nov 2017, 17:37:58 UTC a minute ago, a task errored out after 19 minutes with 207 (0x000000CF) EXIT_NO_SUB_TASKS more details: https://lhcathome.cern.ch/lhcathome/result.php?resultid=163688697 In fact, all these errors are the same which I have with CMS tasks. Is there, all of a sudden, something wrong with my systems(s)? I don't think so, though, since other crunchers are reporting the same errors. When will someone at LHC look into these problems? ID: 32998 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1984 Credit: 161,370,607 RAC: 66,291	Message 32999 - Posted: 5 Nov 2017, 17:43:02 UTC and now the next one with 207 (0x000000CF) EXIT_NO_SUB_TASKS erroring out after 8 minutes. more Details: https://lhcathome.cern.ch/lhcathome/result.php?resultid=163692055 Excerpt: 2017-11-05 18:30:40 (1708): Guest Log: [DEBUG] DC_NOP failed! 2017-11-05 18:30:40 (1708): Guest Log: SECMAN:2007:Failed to end classad message. 2017-11-05 18:30:40 (1708): Guest Log: 11/05/17 18:30:31 recognized DC_NOP as command name, using command 60011. 2017-11-05 18:30:40 (1708): Guest Log: 11/05/17 18:30:52 SECMAN: no classad from server, failing 2017-11-05 18:30:43 (1708): Guest Log: [ERROR] Could not ping HTCondor. 2017-11-05 18:30:43 (1708): Guest Log: [INFO] Shutting Down. What's wrong with the Condor Server? ID: 32999 · Reply Quote

Harri Liljeroos Send message Joined: 28 Sep 04 Posts: 804 Credit: 65,825,918 RAC: 26,855	Message 33001 - Posted: 5 Nov 2017, 18:57:37 UTC If you check the graphics of CMS jobs here:https://lhcathomedev.cern.ch/lhcathome-dev/cms_job.php and LHCb jobs here:https://lhcathomedev.cern.ch/lhcathome-dev/lhcb_job.php, you can see that quite a lot of jobs are being crunched by someone. So the connection to Condor is working for some people. Sadly many of us seem to get failures most of the time. Not easy to find the problem, I think. Imagine how much more work could be done if the connection was reliable. ID: 33001 · Reply Quote