Thread 'Condor exited after 608s without running a job'

Author	Message
rbpeake Send message Joined: 17 Sep 04 Posts: 106 Credit: 36,549,147 RAC: 3	Message 27898 - Posted: 22 Nov 2016, 2:32:50 UTC Last modified: 22 Nov 2016, 2:35:43 UTC What is the story with this? 2016-11-21 21:15:40 (19320): Guest Log: [INFO] Requesting an X509 credential from vLHC@home 2016-11-21 21:15:40 (19320): Guest Log: [INFO] Requesting an X509 credential from LHC@home 2016-11-21 21:15:50 (19320): Guest Log: [INFO] LHCb application starting. Check log files. 2016-11-21 21:15:50 (19320): Guest Log: [DEBUG] HTCondor ping 2016-11-21 21:15:50 (19320): Guest Log: [DEBUG] 0 2016-11-21 21:25:52 (19320): Guest Log: [ERROR] Condor exited after 608s without running a job. 2016-11-21 21:25:52 (19320): VM Completion File Detected. 2016-11-21 21:25:52 (19320): VM Completion Message: Condor exited after 608s without running a job. . 2016-11-21 21:25:52 (19320): Powering off VM. 2016-11-21 21:25:54 (19320): Successfully stopped VM. 2016-11-21 21:25:59 (19320): Deregistering VM. (boinc_3cc9e875450a7dc9, slot#1) 2016-11-21 21:25:59 (19320): Removing virtual disk drive(s) from VM. 2016-11-21 21:25:59 (19320): Removing network bandwidth throttle group from VM. 2016-11-21 21:25:59 (19320): Removing storage controller(s) from VM. 2016-11-21 21:25:59 (19320): Removing VM from VirtualBox. 21:26:04 (19320): called boinc_finish(206) </stderr_txt> ]]> Thanks! Regards, Bob P. ID: 27898 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1529 Credit: 10,020,096 RAC: 1,156	Message 27899 - Posted: 22 Nov 2016, 9:04:59 UTC - in response to Message 27898. The VM did not get a LHCb-job within 10 minutes after the VM started. To avoid running an idle VM for up to 36 hours (max for LHCb) or 18 hours (max for Theory), the VM is killed to free the CPU-core for other BOINC-tasks. ID: 27899 · Reply Quote

rbpeake Send message Joined: 17 Sep 04 Posts: 106 Credit: 36,549,147 RAC: 3	Message 27902 - Posted: 22 Nov 2016, 12:41:13 UTC - in response to Message 27899. Last modified: 22 Nov 2016, 12:43:03 UTC The VM did not get a LHCb-job within 10 minutes after the VM started. To avoid running an idle VM for up to 36 hours (max for LHCb) or 18 hours (max for Theory), the VM is killed to free the CPU-core for other BOINC-tasks. Seems kind of inefficient, although I realize it is way more efficient than it could have been! Still seems like there is a ways to go, so that even the 608 seconds does not have to be wasted. Regards, Bob P. ID: 27902 · Reply Quote

captainjack Send message Joined: 21 Jun 10 Posts: 44 Credit: 14,759,789 RAC: 3,003	Message 27920 - Posted: 23 Nov 2016, 15:04:00 UTC Still no response from HTCondor. That must be why the average run time for all Beauty tasks is 0.16 hours. Time to turn this one off for a while. ID: 27920 · Reply Quote

tullio Send message Joined: 19 Feb 08 Posts: 708 Credit: 4,336,250 RAC: 0	Message 27921 - Posted: 23 Nov 2016, 15:10:46 UTC All LHCb tasks seem to fail on the Windows 10 host while CMS tasks work OK. Tullio ID: 27921 · Reply Quote

Thierry Van Driessche Send message Joined: 1 Sep 04 Posts: 157 Credit: 82,604 RAC: 0	Message 27944 - Posted: 24 Nov 2016, 16:11:25 UTC Last modified: 24 Nov 2016, 16:24:17 UTC Task 108692763 Boinc version 7.6.22, VM 5.1.10r112026, Windows 10 version 1607 2016-11-24 16:59:43 (14244): vboxwrapper (7.7.26196): starting 2016-11-24 16:59:43 (14244): Feature: Checkpoint interval offset (406 seconds) 2016-11-24 16:59:43 (14244): Detected: VirtualBox COM Interface (Version: 5.1.10) 2016-11-24 16:59:43 (14244): Detected: Minimum checkpoint interval (600.000000 seconds) 2016-11-24 16:59:43 (14244): Detected: Heartbeat check (file: 'heartbeat' every 1200.000000 seconds) 2016-11-24 16:59:43 (14244): Starting VM. (boinc_e239fc572f3ad269, slot#9) 2016-11-24 17:00:03 (14244): Successfully started VM. (PID = '11892') 2016-11-24 17:00:03 (14244): Reporting VM Process ID to BOINC. 2016-11-24 17:00:13 (14244): Guest Log: [ERROR] Condor exited after 73940s without running a job. 2016-11-24 17:00:13 (14244): Guest Log: [INFO] Shutting Down. 2016-11-24 17:00:13 (14244): VM state change detected. (old = 'poweroff', new = 'running') 2016-11-24 17:00:23 (14244): Detected: Web Application Enabled (http://localhost:51829) 2016-11-24 17:00:34 (14244): VM Completion File Detected. 2016-11-24 17:00:34 (14244): VM Completion Message: Condor exited after 73940s without running a job. . 2016-11-24 17:00:34 (14244): Powering off VM. 2016-11-24 17:00:35 (14244): Successfully stopped VM. 2016-11-24 17:00:40 (14244): Deregistering VM. (boinc_e239fc572f3ad269, slot#9) 2016-11-24 17:00:40 (14244): Removing virtual disk drive(s) from VM. 2016-11-24 17:00:40 (14244): Removing network bandwidth throttle group from VM. 2016-11-24 17:00:40 (14244): Removing storage controller(s) from VM. 2016-11-24 17:00:40 (14244): Removing VM from VirtualBox. 17:00:45 (14244): called boinc_finish(206) </stderr_txt> ]]> Initial time to crunch the WU was given something like a good 5 minutes, after running for a while the estimated time went over the 1 day. ID: 27944 · Reply Quote

Thierry Van Driessche Send message Joined: 1 Sep 04 Posts: 157 Credit: 82,604 RAC: 0	Message 27945 - Posted: 24 Nov 2016, 17:16:24 UTC 4 other tasks where running with a similar problem ID: 27945 · Reply Quote

gyllic Send message Joined: 9 Dec 14 Posts: 202 Credit: 2,659,192 RAC: 13	Message 27946 - Posted: 24 Nov 2016, 20:16:08 UTC exact same problem here (with a linux host) ID: 27946 · Reply Quote

Laurence Project administrator Project developer Send message Joined: 20 Jun 14 Posts: 431 Credit: 252,627 RAC: 320	Message 27949 - Posted: 24 Nov 2016, 21:28:43 UTC - in response to Message 27946. Last modified: 24 Nov 2016, 21:38:05 UTC There was an issue with the job submission which resulting in there being no jobs. Hopefully this has been fixed. Thanks for reporting this. ID: 27949 · Reply Quote

captainjack Send message Joined: 21 Jun 10 Posts: 44 Credit: 14,759,789 RAC: 3,003	Message 27951 - Posted: 25 Nov 2016, 1:32:33 UTC Just tried another one and it looks like the problem still exists. Task number 108722165. 2016-11-24 19:16:39 (3560): Guest Log: [DEBUG] HTCondor ping 2016-11-24 19:16:49 (3560): Guest Log: [DEBUG] 0 2016-11-24 19:27:10 (3560): Guest Log: [ERROR] Condor exited after 627s without running a job. 2016-11-24 19:27:10 (3560): Guest Log: [INFO] Shutting Down. 2016-11-24 19:27:10 (3560): VM Completion File Detected. 2016-11-24 19:27:10 (3560): VM Completion Message: Condor exited after 627s without running a job. Let me know if you need more information. ID: 27951 · Reply Quote

Jesse Viviano Send message Joined: 12 Feb 14 Posts: 72 Credit: 4,639,155 RAC: 0	Message 27974 - Posted: 27 Nov 2016, 17:37:18 UTC This also happens if your network connection is unreliable. My family thinks that Ethernet is so yesterday and would rather place the network router near the center of the house and cover the house with Wi-Fi. There are several problems with Wi-Fi. 2.4 GHz Wi-Fi does penetrate walls, but it is wrecked by poorly-shielded microwave ovens. 5.8 GHz Wi-Fi does not penetrate walls well, and my computer has too many of those walls between it and the router for 5.8 GHz Wi-Fi to run at an acceptable speed. BOINC handles that unreliability well, but LHC@home does not. This is why I put an item in the wish list that work units would be self-contained instead of having to fetch something from a server in the middle of the work unit. ID: 27974 · Reply Quote

gyllic Send message Joined: 9 Dec 14 Posts: 202 Credit: 2,659,192 RAC: 13	Message 27982 - Posted: 28 Nov 2016, 8:11:56 UTC - in response to Message 27949. Seems like the problem still exists: 2016-11-28 08:46:48 (22044): Guest Log: [DEBUG] HTCondor ping 2016-11-28 08:46:49 (22044): Guest Log: [DEBUG] 0 2016-11-28 08:57:21 (22044): Guest Log: [ERROR] Condor exited after 628s without running a job. 2016-11-28 08:57:21 (22044): Guest Log: [INFO] Shutting Down. 2016-11-28 08:57:21 (22044): VM Completion File Detected. 2016-11-28 08:57:21 (22044): VM Completion Message: Condor exited after 628s without running a job. . 2016-11-28 08:57:21 (22044): Powering off VM. 2016-11-28 08:57:21 (22044): Successfully stopped VM. This also happens if your network connection is unreliable. My internet connection is very reliable, but the problem still occurs. I think this is a problem "on the server side". ID: 27982 · Reply Quote

Laurence Project administrator Project developer Send message Joined: 20 Jun 14 Posts: 431 Credit: 252,627 RAC: 320	Message 27983 - Posted: 28 Nov 2016, 10:19:20 UTC - in response to Message 27982. This is a little confusing as both LHCb and Theory are using the same infrastructure. Will investigate ... ID: 27983 · Reply Quote

Laurence Project administrator Project developer Send message Joined: 20 Jun 14 Posts: 431 Credit: 252,627 RAC: 320	Message 27986 - Posted: 28 Nov 2016, 12:51:46 UTC - in response to Message 27983. It looks like an old image is used here. It need to be upgraded. ID: 27986 · Reply Quote

Laurence Project administrator Project developer Send message Joined: 20 Jun 14 Posts: 431 Credit: 252,627 RAC: 320	Message 27987 - Posted: 28 Nov 2016, 15:12:17 UTC - in response to Message 27986. The image has been updated. ID: 27987 · Reply Quote

captainjack Send message Joined: 21 Jun 10 Posts: 44 Credit: 14,759,789 RAC: 3,003	Message 27989 - Posted: 28 Nov 2016, 17:06:14 UTC Looks like it is working for me now. The task has made it past the 608 second mark and is using a full CPU thread. Thanks for getting the image updated. Will post again if anything changes. ID: 27989 · Reply Quote

ritterm Send message Joined: 30 May 08 Posts: 93 Credit: 5,160,246 RAC: 0	Message 28470 - Posted: 13 Jan 2017, 14:43:06 UTC I've had several LHCb tasks fail recently due to lack of work (e.g., Task 111867891). I've also had some fail due to Condor connection errors (e.g., Task 111867740). I don't see any issues like this posted on the boards recently and my other hosts running CMS and Theory tasks don't seem to have any problems. Not necessarily a problem for me, but wanted to point it out in case it's indicative of a more significant problem. ID: 28470 · Reply Quote

Harri Liljeroos Send message Joined: 28 Sep 04 Posts: 795 Credit: 64,422,582 RAC: 33,111	Message 28670 - Posted: 25 Jan 2017, 18:59:36 UTC Why condor suddenly exited after 48000 seconds on this task https://lhcathome.cern.ch/lhcathome/result.php?resultid=115400084 ? It run almost 14 hours and did finish several jobs but then failed suddenly. This resulted in EXIT_INIT_FAILURE and no credit. ID: 28670 · Reply Quote

Jim1348 Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0	Message 28801 - Posted: 6 Feb 2017, 16:19:20 UTC I have gotten a couple more of these, and I am concerned that more may follow. https://lhcathome.cern.ch/lhcathome/result.php?resultid=118802423 https://lhcathome.cern.ch/lhcathome/result.php?resultid=118802441 ID: 28801 · Reply Quote

gyllic Send message Joined: 9 Dec 14 Posts: 202 Credit: 2,659,192 RAC: 13	Message 28802 - Posted: 6 Feb 2017, 18:58:06 UTC same here: https://lhcathome.cern.ch/lhcathome/result.php?resultid=118822828 https://lhcathome.cern.ch/lhcathome/result.php?resultid=118713230 https://lhcathome.cern.ch/lhcathome/result.php?resultid=118801990 ID: 28802 · Reply Quote