Message boards :
LHCb Application :
EXIT_INIT_FAILURE 206, check here if there is work
Message board moderation
Author | Message |
---|---|
Send message Joined: 27 Sep 08 Posts: 798 Credit: 644,769,949 RAC: 232,021 |
To check if there is work look at this status page: http://lhcathomedev.cern.ch/lhcathome-dev/lhcb_job.php |
Send message Joined: 15 Jun 08 Posts: 2386 Credit: 222,954,549 RAC: 136,930 |
What makes me wonder is how the grafic has to be interpreted. If you look at the timestamp 2017-05-03:00:00 (last midnight) the green line shows 100 jobs. Check that point in a few hours and you will see the number of job rising although the timestamp is in the past. Can anybody explain that? |
Send message Joined: 27 Sep 08 Posts: 798 Credit: 644,769,949 RAC: 232,021 |
I look at the current being 1.02 this is almost zero. not sure about what else to take from it. |
Send message Joined: 15 Jun 08 Posts: 2386 Credit: 222,954,549 RAC: 136,930 |
I also interpret this as "no jobs available". Yesterday morning the most recent entry was also 0. But look at the graph now. It shows more than 300 jobs at that timestamp. |
Send message Joined: 26 Apr 17 Posts: 7 Credit: 22,463 RAC: 0 |
The plot is generated from accounting data. This introduces some delay from the moment a job finishes in your VM to the moment outputs are further managed and status set. In addition (sigh) last point to the right is always 0. That doesn't mean that jobs are not available and/or are not running/finishing. For instance, at the moment there are ~150 jobs running. L |
Send message Joined: 15 Jun 08 Posts: 2386 Credit: 222,954,549 RAC: 136,930 |
Thank you Luca. Where can a normal user see if there are jobs ready to send (and how much)? I don't write about the server status page as there you can only see the number of available WUs. If you start a WU and there are no jobs available you will get an EXIT_INIT_FAILURE. This is what has to be avoided. |
Send message Joined: 26 Apr 17 Posts: 7 Credit: 22,463 RAC: 0 |
Hi, now plots should be more reliable (still with extrapolation to 0 to the right). Just fixed an issue in the post-processing that slowed-down a lot the status-update (and consequently the plots). Still investigating the issue with EXIT_INIT_FAILURE: in principle there should always be jobs to be picked up from the VMs, apart from temporary issues which is not the case these days. LHCb do not pre-select jobs to be sent to the community, you pick-up jobs from the same 'queue' as all other sites. I'll report asap on this. |
Send message Joined: 15 Jun 08 Posts: 2386 Credit: 222,954,549 RAC: 136,930 |
Got 2 jobs on each of my hosts this morning. Looks ok so far. |
Send message Joined: 26 Apr 17 Posts: 7 Credit: 22,463 RAC: 0 |
Hi, the issue with 206 error should also be mitigated now (since yesterday night). It was a glitch on the boinc server-side which prevented to send workloads to the VMs even if LHCb had availability of jobs. Please, try to run LHCb jobs! Cheers, Luca |
Send message Joined: 30 Aug 14 Posts: 145 Credit: 10,847,070 RAC: 0 |
My dedicated LHCb machine produces nothing but EXIT_INIT_FAILURE 206 since today. Anything wrong again with Boinc, or is project out of work right now? Found these lines in the logfile: 2017-10-07 13:49:29 (3059): Guest Log: ERROR: Couldn't read proxy from: /tmp/x509up_u0 2017-10-07 13:49:29 (3059): Guest Log: globus_credential: Error reading proxy credential 2017-10-07 13:49:29 (3059): Guest Log: globus_credential: Error reading proxy credential: Couldn't read PEM from bio 2017-10-07 13:49:29 (3059): Guest Log: OpenSSL Error: pem_lib.c:703: in library: PEM routines, function PEM_read_bio: no start line 2017-10-07 13:49:29 (3059): Guest Log: Use -debug for further information. 2017-10-07 13:49:29 (3059): Guest Log: [ERROR] Could not get an x509 credential 2017-10-07 13:49:29 (3059): Guest Log: [ERROR] The x509 proxy creation failed. Greetz, djoser. Why mine when you can research? - GRIDCOIN - Real cryptocurrency without wasting hashes! https://gridcoin.us |
Send message Joined: 14 Jan 10 Posts: 1268 Credit: 8,421,616 RAC: 2,139 |
2017-10-07 13:49:29 (3059): Guest Log: [ERROR] Could not get an x509 credential Your VM cannot make contact to CERN server, cause the authentication failed. The problem is at the project site and since it's weekend we probably have to wait until Monday. I've the same problem with the Theory tasks. |
Send message Joined: 30 Aug 14 Posts: 145 Credit: 10,847,070 RAC: 0 |
Thanks for your answer. I wonder what projects are affected. So far i know about LHCb and Theory tasks. Set my machine to no new work for the moment. Regards, djoser. Why mine when you can research? - GRIDCOIN - Real cryptocurrency without wasting hashes! https://gridcoin.us |
Send message Joined: 20 Jun 14 Posts: 372 Credit: 238,712 RAC: 0 |
The problem should now be fixed. |
Send message Joined: 28 Sep 04 Posts: 674 Credit: 43,152,472 RAC: 15,698 |
A lot of task failures with the error in the title. Like here: https://lhcathome.cern.ch/lhcathome/result.php?resultid=163450255 And also with error 207 (0x000000CF) EXIT_NO_SUB_TASKS. Like here: https://lhcathome.cern.ch/lhcathome/result.php?resultid=163449631 |
Send message Joined: 18 Dec 15 Posts: 1686 Credit: 100,395,668 RAC: 102,181 |
A lot of task failures with the error in the title. Like here: https://lhcathome.cern.ch/lhcathome/result.php?resultid=163450255 I've been experiencing exactly the same problems with CMS tasks during the past 4-5 days. In most of the cases, one could read somewhere in the stderr text of the failed task (in different variations) that the connection to the Condor Server was not possible. Since LHCb also needs to connect to the Condor Server, these problems won't disappear as long as the Condor Server problem is not being fixed. |
Send message Joined: 18 Dec 15 Posts: 1686 Credit: 100,395,668 RAC: 102,181 |
just a few minutes ago, I had three LHCb tasks in a row which errored out after 10-14 minutes, with stderr: 2017-11-05 17:13:29 (5728): Guest Log: [DEBUG] Testing connection to Condor server on port 9618 2017-11-05 17:13:59 (5728): Guest Log: [DEBUG] nc: connect to vccondor01.cern.ch port 9618 (tcp) timed out: Operation now in progress 2017-11-05 17:13:59 (5728): Guest Log: [DEBUG] 1 2017-11-05 17:13:59 (5728): Guest Log: [ERROR] Could not connect to Condor server on port 9618 2017-11-05 17:13:59 (5728): Guest Log: [INFO] Shutting Down. |
Send message Joined: 18 Dec 15 Posts: 1686 Credit: 100,395,668 RAC: 102,181 |
a minute ago, a task errored out after 19 minutes with 207 (0x000000CF) EXIT_NO_SUB_TASKS more details: https://lhcathome.cern.ch/lhcathome/result.php?resultid=163688697 In fact, all these errors are the same which I have with CMS tasks. Is there, all of a sudden, something wrong with my systems(s)? I don't think so, though, since other crunchers are reporting the same errors. When will someone at LHC look into these problems? |
Send message Joined: 18 Dec 15 Posts: 1686 Credit: 100,395,668 RAC: 102,181 |
and now the next one with 207 (0x000000CF) EXIT_NO_SUB_TASKS erroring out after 8 minutes. more Details: https://lhcathome.cern.ch/lhcathome/result.php?resultid=163692055 Excerpt: 2017-11-05 18:30:40 (1708): Guest Log: [DEBUG] DC_NOP failed! 2017-11-05 18:30:40 (1708): Guest Log: SECMAN:2007:Failed to end classad message. 2017-11-05 18:30:40 (1708): Guest Log: 11/05/17 18:30:31 recognized DC_NOP as command name, using command 60011. 2017-11-05 18:30:40 (1708): Guest Log: 11/05/17 18:30:52 SECMAN: no classad from server, failing 2017-11-05 18:30:43 (1708): Guest Log: [ERROR] Could not ping HTCondor. 2017-11-05 18:30:43 (1708): Guest Log: [INFO] Shutting Down. What's wrong with the Condor Server? |
Send message Joined: 28 Sep 04 Posts: 674 Credit: 43,152,472 RAC: 15,698 |
If you check the graphics of CMS jobs here:https://lhcathomedev.cern.ch/lhcathome-dev/cms_job.php and LHCb jobs here:https://lhcathomedev.cern.ch/lhcathome-dev/lhcb_job.php, you can see that quite a lot of jobs are being crunched by someone. So the connection to Condor is working for some people. Sadly many of us seem to get failures most of the time. Not easy to find the problem, I think. Imagine how much more work could be done if the connection was reliable. |
©2024 CERN