Message boards :
LHCb Application :
Condor exited after 608s without running a job
Message board moderation
Author | Message |
---|---|
Send message Joined: 17 Sep 04 Posts: 105 Credit: 32,824,853 RAC: 389 |
What is the story with this? 2016-11-21 21:15:40 (19320): Guest Log: [INFO] Requesting an X509 credential from vLHC@home 2016-11-21 21:15:40 (19320): Guest Log: [INFO] Requesting an X509 credential from LHC@home 2016-11-21 21:15:50 (19320): Guest Log: [INFO] LHCb application starting. Check log files. 2016-11-21 21:15:50 (19320): Guest Log: [DEBUG] HTCondor ping 2016-11-21 21:15:50 (19320): Guest Log: [DEBUG] 0 2016-11-21 21:25:52 (19320): Guest Log: [ERROR] Condor exited after 608s without running a job. 2016-11-21 21:25:52 (19320): VM Completion File Detected. 2016-11-21 21:25:52 (19320): VM Completion Message: Condor exited after 608s without running a job. . 2016-11-21 21:25:52 (19320): Powering off VM. 2016-11-21 21:25:54 (19320): Successfully stopped VM. 2016-11-21 21:25:59 (19320): Deregistering VM. (boinc_3cc9e875450a7dc9, slot#1) 2016-11-21 21:25:59 (19320): Removing virtual disk drive(s) from VM. 2016-11-21 21:25:59 (19320): Removing network bandwidth throttle group from VM. 2016-11-21 21:25:59 (19320): Removing storage controller(s) from VM. 2016-11-21 21:25:59 (19320): Removing VM from VirtualBox. 21:26:04 (19320): called boinc_finish(206) </stderr_txt> ]]>
Regards, Bob P. |
Send message Joined: 14 Jan 10 Posts: 1418 Credit: 9,470,586 RAC: 3,147 |
The VM did not get a LHCb-job within 10 minutes after the VM started. To avoid running an idle VM for up to 36 hours (max for LHCb) or 18 hours (max for Theory), the VM is killed to free the CPU-core for other BOINC-tasks. |
Send message Joined: 17 Sep 04 Posts: 105 Credit: 32,824,853 RAC: 389 |
The VM did not get a LHCb-job within 10 minutes after the VM started. Seems kind of inefficient, although I realize it is way more efficient than it could have been! Still seems like there is a ways to go, so that even the 608 seconds does not have to be wasted. Regards, Bob P. |
Send message Joined: 21 Jun 10 Posts: 40 Credit: 11,331,644 RAC: 7,456 |
Still no response from HTCondor. That must be why the average run time for all Beauty tasks is 0.16 hours. Time to turn this one off for a while. |
Send message Joined: 19 Feb 08 Posts: 708 Credit: 4,336,250 RAC: 0 |
All LHCb tasks seem to fail on the Windows 10 host while CMS tasks work OK. Tullio |
Send message Joined: 1 Sep 04 Posts: 157 Credit: 82,604 RAC: 0 |
Task 108692763 Boinc version 7.6.22, VM 5.1.10r112026, Windows 10 version 1607 2016-11-24 16:59:43 (14244): vboxwrapper (7.7.26196): starting 2016-11-24 16:59:43 (14244): Feature: Checkpoint interval offset (406 seconds) 2016-11-24 16:59:43 (14244): Detected: VirtualBox COM Interface (Version: 5.1.10) 2016-11-24 16:59:43 (14244): Detected: Minimum checkpoint interval (600.000000 seconds) 2016-11-24 16:59:43 (14244): Detected: Heartbeat check (file: 'heartbeat' every 1200.000000 seconds) 2016-11-24 16:59:43 (14244): Starting VM. (boinc_e239fc572f3ad269, slot#9) 2016-11-24 17:00:03 (14244): Successfully started VM. (PID = '11892') 2016-11-24 17:00:03 (14244): Reporting VM Process ID to BOINC. 2016-11-24 17:00:13 (14244): Guest Log: [ERROR] Condor exited after 73940s without running a job. 2016-11-24 17:00:13 (14244): Guest Log: [INFO] Shutting Down. 2016-11-24 17:00:13 (14244): VM state change detected. (old = 'poweroff', new = 'running') 2016-11-24 17:00:23 (14244): Detected: Web Application Enabled (http://localhost:51829) 2016-11-24 17:00:34 (14244): VM Completion File Detected. 2016-11-24 17:00:34 (14244): VM Completion Message: Condor exited after 73940s without running a job. . 2016-11-24 17:00:34 (14244): Powering off VM. 2016-11-24 17:00:35 (14244): Successfully stopped VM. 2016-11-24 17:00:40 (14244): Deregistering VM. (boinc_e239fc572f3ad269, slot#9) 2016-11-24 17:00:40 (14244): Removing virtual disk drive(s) from VM. 2016-11-24 17:00:40 (14244): Removing network bandwidth throttle group from VM. 2016-11-24 17:00:40 (14244): Removing storage controller(s) from VM. 2016-11-24 17:00:40 (14244): Removing VM from VirtualBox. 17:00:45 (14244): called boinc_finish(206) </stderr_txt> ]]> Initial time to crunch the WU was given something like a good 5 minutes, after running for a while the estimated time went over the 1 day. |
Send message Joined: 1 Sep 04 Posts: 157 Credit: 82,604 RAC: 0 |
4 other tasks where running with a similar problem |
Send message Joined: 9 Dec 14 Posts: 202 Credit: 2,533,875 RAC: 0 |
exact same problem here (with a linux host) |
Send message Joined: 20 Jun 14 Posts: 380 Credit: 238,712 RAC: 0 |
There was an issue with the job submission which resulting in there being no jobs. Hopefully this has been fixed. Thanks for reporting this. |
Send message Joined: 21 Jun 10 Posts: 40 Credit: 11,331,644 RAC: 7,456 |
Just tried another one and it looks like the problem still exists. Task number 108722165. 2016-11-24 19:16:39 (3560): Guest Log: [DEBUG] HTCondor ping Let me know if you need more information. |
Send message Joined: 12 Feb 14 Posts: 72 Credit: 4,639,155 RAC: 0 |
This also happens if your network connection is unreliable. My family thinks that Ethernet is so yesterday and would rather place the network router near the center of the house and cover the house with Wi-Fi. There are several problems with Wi-Fi. 2.4 GHz Wi-Fi does penetrate walls, but it is wrecked by poorly-shielded microwave ovens. 5.8 GHz Wi-Fi does not penetrate walls well, and my computer has too many of those walls between it and the router for 5.8 GHz Wi-Fi to run at an acceptable speed. BOINC handles that unreliability well, but LHC@home does not. This is why I put an item in the wish list that work units would be self-contained instead of having to fetch something from a server in the middle of the work unit. |
Send message Joined: 9 Dec 14 Posts: 202 Credit: 2,533,875 RAC: 0 |
Seems like the problem still exists: 2016-11-28 08:46:48 (22044): Guest Log: [DEBUG] HTCondor ping This also happens if your network connection is unreliable. My internet connection is very reliable, but the problem still occurs. I think this is a problem "on the server side". |
Send message Joined: 20 Jun 14 Posts: 380 Credit: 238,712 RAC: 0 |
This is a little confusing as both LHCb and Theory are using the same infrastructure. Will investigate ... |
Send message Joined: 20 Jun 14 Posts: 380 Credit: 238,712 RAC: 0 |
It looks like an old image is used here. It need to be upgraded. |
Send message Joined: 20 Jun 14 Posts: 380 Credit: 238,712 RAC: 0 |
The image has been updated. |
Send message Joined: 21 Jun 10 Posts: 40 Credit: 11,331,644 RAC: 7,456 |
Looks like it is working for me now. The task has made it past the 608 second mark and is using a full CPU thread. Thanks for getting the image updated. Will post again if anything changes. |
Send message Joined: 30 May 08 Posts: 93 Credit: 5,160,246 RAC: 0 |
I've had several LHCb tasks fail recently due to lack of work (e.g., Task 111867891). I've also had some fail due to Condor connection errors (e.g., Task 111867740). I don't see any issues like this posted on the boards recently and my other hosts running CMS and Theory tasks don't seem to have any problems. Not necessarily a problem for me, but wanted to point it out in case it's indicative of a more significant problem. |
Send message Joined: 28 Sep 04 Posts: 728 Credit: 49,049,262 RAC: 27,051 |
Why condor suddenly exited after 48000 seconds on this task https://lhcathome.cern.ch/lhcathome/result.php?resultid=115400084 ? It run almost 14 hours and did finish several jobs but then failed suddenly. This resulted in EXIT_INIT_FAILURE and no credit. |
Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0 |
I have gotten a couple more of these, and I am concerned that more may follow. https://lhcathome.cern.ch/lhcathome/result.php?resultid=118802423 https://lhcathome.cern.ch/lhcathome/result.php?resultid=118802441 |
Send message Joined: 9 Dec 14 Posts: 202 Credit: 2,533,875 RAC: 0 |
|
©2024 CERN