Message boards : LHCb Application : Condor exited after 608s without running a job
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile rbpeake

Send message
Joined: 17 Sep 04
Posts: 99
Credit: 30,621,149
RAC: 3,802
Message 27898 - Posted: 22 Nov 2016, 2:32:50 UTC
Last modified: 22 Nov 2016, 2:35:43 UTC

What is the story with this?

    2016-11-21 21:15:40 (19320): Guest Log: [INFO] Requesting an X509 credential from vLHC@home
    2016-11-21 21:15:40 (19320): Guest Log: [INFO] Requesting an X509 credential from LHC@home
    2016-11-21 21:15:50 (19320): Guest Log: [INFO] LHCb application starting. Check log files.
    2016-11-21 21:15:50 (19320): Guest Log: [DEBUG] HTCondor ping
    2016-11-21 21:15:50 (19320): Guest Log: [DEBUG] 0
    2016-11-21 21:25:52 (19320): Guest Log: [ERROR] Condor exited after 608s without running a job.
    2016-11-21 21:25:52 (19320): VM Completion File Detected.
    2016-11-21 21:25:52 (19320): VM Completion Message: Condor exited after 608s without running a job.
    .
    2016-11-21 21:25:52 (19320): Powering off VM.
    2016-11-21 21:25:54 (19320): Successfully stopped VM.
    2016-11-21 21:25:59 (19320): Deregistering VM. (boinc_3cc9e875450a7dc9, slot#1)
    2016-11-21 21:25:59 (19320): Removing virtual disk drive(s) from VM.
    2016-11-21 21:25:59 (19320): Removing network bandwidth throttle group from VM.
    2016-11-21 21:25:59 (19320): Removing storage controller(s) from VM.
    2016-11-21 21:25:59 (19320): Removing VM from VirtualBox.
    21:26:04 (19320): called boinc_finish(206)

    </stderr_txt>
    ]]>


Thanks!


Regards,
Bob P.
ID: 27898 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1268
Credit: 8,421,800
RAC: 1,930
Message 27899 - Posted: 22 Nov 2016, 9:04:59 UTC - in response to Message 27898.  

The VM did not get a LHCb-job within 10 minutes after the VM started.
To avoid running an idle VM for up to 36 hours (max for LHCb) or 18 hours (max for Theory), the VM is killed to free the CPU-core for other BOINC-tasks.
ID: 27899 · Report as offensive     Reply Quote
Profile rbpeake

Send message
Joined: 17 Sep 04
Posts: 99
Credit: 30,621,149
RAC: 3,802
Message 27902 - Posted: 22 Nov 2016, 12:41:13 UTC - in response to Message 27899.  
Last modified: 22 Nov 2016, 12:43:03 UTC

The VM did not get a LHCb-job within 10 minutes after the VM started.
To avoid running an idle VM for up to 36 hours (max for LHCb) or 18 hours (max for Theory), the VM is killed to free the CPU-core for other BOINC-tasks.

Seems kind of inefficient, although I realize it is way more efficient than it could have been! Still seems like there is a ways to go, so that even the 608 seconds does not have to be wasted.
Regards,
Bob P.
ID: 27902 · Report as offensive     Reply Quote
captainjack

Send message
Joined: 21 Jun 10
Posts: 40
Credit: 10,608,629
RAC: 10,627
Message 27920 - Posted: 23 Nov 2016, 15:04:00 UTC

Still no response from HTCondor. That must be why the average run time for all Beauty tasks is 0.16 hours.

Time to turn this one off for a while.
ID: 27920 · Report as offensive     Reply Quote
tullio

Send message
Joined: 19 Feb 08
Posts: 708
Credit: 4,336,250
RAC: 0
Message 27921 - Posted: 23 Nov 2016, 15:10:46 UTC

All LHCb tasks seem to fail on the Windows 10 host while CMS tasks work OK.
Tullio
ID: 27921 · Report as offensive     Reply Quote
Profile Thierry Van Driessche
Avatar

Send message
Joined: 1 Sep 04
Posts: 157
Credit: 82,604
RAC: 0
Message 27944 - Posted: 24 Nov 2016, 16:11:25 UTC
Last modified: 24 Nov 2016, 16:24:17 UTC

Task 108692763

Boinc version 7.6.22, VM 5.1.10r112026, Windows 10 version 1607

2016-11-24 16:59:43 (14244): vboxwrapper (7.7.26196): starting
2016-11-24 16:59:43 (14244): Feature: Checkpoint interval offset (406 seconds)
2016-11-24 16:59:43 (14244): Detected: VirtualBox COM Interface (Version: 5.1.10)
2016-11-24 16:59:43 (14244): Detected: Minimum checkpoint interval (600.000000 seconds)
2016-11-24 16:59:43 (14244): Detected: Heartbeat check (file: 'heartbeat' every 1200.000000 seconds)
2016-11-24 16:59:43 (14244): Starting VM. (boinc_e239fc572f3ad269, slot#9)
2016-11-24 17:00:03 (14244): Successfully started VM. (PID = '11892')
2016-11-24 17:00:03 (14244): Reporting VM Process ID to BOINC.
2016-11-24 17:00:13 (14244): Guest Log: [ERROR] Condor exited after 73940s without running a job.
2016-11-24 17:00:13 (14244): Guest Log: [INFO] Shutting Down.
2016-11-24 17:00:13 (14244): VM state change detected. (old = 'poweroff', new = 'running')
2016-11-24 17:00:23 (14244): Detected: Web Application Enabled (http://localhost:51829)
2016-11-24 17:00:34 (14244): VM Completion File Detected.
2016-11-24 17:00:34 (14244): VM Completion Message: Condor exited after 73940s without running a job.
.
2016-11-24 17:00:34 (14244): Powering off VM.
2016-11-24 17:00:35 (14244): Successfully stopped VM.
2016-11-24 17:00:40 (14244): Deregistering VM. (boinc_e239fc572f3ad269, slot#9)
2016-11-24 17:00:40 (14244): Removing virtual disk drive(s) from VM.
2016-11-24 17:00:40 (14244): Removing network bandwidth throttle group from VM.
2016-11-24 17:00:40 (14244): Removing storage controller(s) from VM.
2016-11-24 17:00:40 (14244): Removing VM from VirtualBox.
17:00:45 (14244): called boinc_finish(206)

</stderr_txt>
]]>


Initial time to crunch the WU was given something like a good 5 minutes, after running for a while the estimated time went over the 1 day.
ID: 27944 · Report as offensive     Reply Quote
Profile Thierry Van Driessche
Avatar

Send message
Joined: 1 Sep 04
Posts: 157
Credit: 82,604
RAC: 0
Message 27945 - Posted: 24 Nov 2016, 17:16:24 UTC

4 other tasks where running with a similar problem
ID: 27945 · Report as offensive     Reply Quote
gyllic

Send message
Joined: 9 Dec 14
Posts: 202
Credit: 2,533,875
RAC: 0
Message 27946 - Posted: 24 Nov 2016, 20:16:08 UTC

exact same problem here (with a linux host)
ID: 27946 · Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer

Send message
Joined: 20 Jun 14
Posts: 372
Credit: 238,712
RAC: 0
Message 27949 - Posted: 24 Nov 2016, 21:28:43 UTC - in response to Message 27946.  
Last modified: 24 Nov 2016, 21:38:05 UTC

There was an issue with the job submission which resulting in there being no jobs. Hopefully this has been fixed. Thanks for reporting this.
ID: 27949 · Report as offensive     Reply Quote
captainjack

Send message
Joined: 21 Jun 10
Posts: 40
Credit: 10,608,629
RAC: 10,627
Message 27951 - Posted: 25 Nov 2016, 1:32:33 UTC

Just tried another one and it looks like the problem still exists. Task number 108722165.

2016-11-24 19:16:39 (3560): Guest Log: [DEBUG] HTCondor ping
2016-11-24 19:16:49 (3560): Guest Log: [DEBUG] 0
2016-11-24 19:27:10 (3560): Guest Log: [ERROR] Condor exited after 627s without running a job.
2016-11-24 19:27:10 (3560): Guest Log: [INFO] Shutting Down.
2016-11-24 19:27:10 (3560): VM Completion File Detected.
2016-11-24 19:27:10 (3560): VM Completion Message: Condor exited after 627s without running a job.


Let me know if you need more information.
ID: 27951 · Report as offensive     Reply Quote
Jesse Viviano

Send message
Joined: 12 Feb 14
Posts: 72
Credit: 4,639,155
RAC: 0
Message 27974 - Posted: 27 Nov 2016, 17:37:18 UTC

This also happens if your network connection is unreliable. My family thinks that Ethernet is so yesterday and would rather place the network router near the center of the house and cover the house with Wi-Fi. There are several problems with Wi-Fi. 2.4 GHz Wi-Fi does penetrate walls, but it is wrecked by poorly-shielded microwave ovens. 5.8 GHz Wi-Fi does not penetrate walls well, and my computer has too many of those walls between it and the router for 5.8 GHz Wi-Fi to run at an acceptable speed. BOINC handles that unreliability well, but LHC@home does not. This is why I put an item in the wish list that work units would be self-contained instead of having to fetch something from a server in the middle of the work unit.
ID: 27974 · Report as offensive     Reply Quote
gyllic

Send message
Joined: 9 Dec 14
Posts: 202
Credit: 2,533,875
RAC: 0
Message 27982 - Posted: 28 Nov 2016, 8:11:56 UTC - in response to Message 27949.  

Seems like the problem still exists:

2016-11-28 08:46:48 (22044): Guest Log: [DEBUG] HTCondor ping
2016-11-28 08:46:49 (22044): Guest Log: [DEBUG] 0
2016-11-28 08:57:21 (22044): Guest Log: [ERROR] Condor exited after 628s without running a job.
2016-11-28 08:57:21 (22044): Guest Log: [INFO] Shutting Down.
2016-11-28 08:57:21 (22044): VM Completion File Detected.
2016-11-28 08:57:21 (22044): VM Completion Message: Condor exited after 628s without running a job.
.
2016-11-28 08:57:21 (22044): Powering off VM.
2016-11-28 08:57:21 (22044): Successfully stopped VM.


This also happens if your network connection is unreliable.

My internet connection is very reliable, but the problem still occurs. I think this is a problem "on the server side".
ID: 27982 · Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer

Send message
Joined: 20 Jun 14
Posts: 372
Credit: 238,712
RAC: 0
Message 27983 - Posted: 28 Nov 2016, 10:19:20 UTC - in response to Message 27982.  

This is a little confusing as both LHCb and Theory are using the same infrastructure. Will investigate ...
ID: 27983 · Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer

Send message
Joined: 20 Jun 14
Posts: 372
Credit: 238,712
RAC: 0
Message 27986 - Posted: 28 Nov 2016, 12:51:46 UTC - in response to Message 27983.  

It looks like an old image is used here. It need to be upgraded.
ID: 27986 · Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer

Send message
Joined: 20 Jun 14
Posts: 372
Credit: 238,712
RAC: 0
Message 27987 - Posted: 28 Nov 2016, 15:12:17 UTC - in response to Message 27986.  

The image has been updated.
ID: 27987 · Report as offensive     Reply Quote
captainjack

Send message
Joined: 21 Jun 10
Posts: 40
Credit: 10,608,629
RAC: 10,627
Message 27989 - Posted: 28 Nov 2016, 17:06:14 UTC

Looks like it is working for me now. The task has made it past the 608 second mark and is using a full CPU thread.

Thanks for getting the image updated.

Will post again if anything changes.
ID: 27989 · Report as offensive     Reply Quote
Profile ritterm
Avatar

Send message
Joined: 30 May 08
Posts: 93
Credit: 5,160,246
RAC: 0
Message 28470 - Posted: 13 Jan 2017, 14:43:06 UTC

I've had several LHCb tasks fail recently due to lack of work (e.g., Task 111867891). I've also had some fail due to Condor connection errors (e.g., Task 111867740).

I don't see any issues like this posted on the boards recently and my other hosts running CMS and Theory tasks don't seem to have any problems. Not necessarily a problem for me, but wanted to point it out in case it's indicative of a more significant problem.
ID: 28470 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 674
Credit: 43,167,342
RAC: 16,168
Message 28670 - Posted: 25 Jan 2017, 18:59:36 UTC

Why condor suddenly exited after 48000 seconds on this task https://lhcathome.cern.ch/lhcathome/result.php?resultid=115400084 ?

It run almost 14 hours and did finish several jobs but then failed suddenly. This resulted in EXIT_INIT_FAILURE and no credit.
ID: 28670 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Nov 14
Posts: 602
Credit: 24,371,321
RAC: 0
Message 28801 - Posted: 6 Feb 2017, 16:19:20 UTC

ID: 28801 · Report as offensive     Reply Quote
gyllic

Send message
Joined: 9 Dec 14
Posts: 202
Credit: 2,533,875
RAC: 0
Message 28802 - Posted: 6 Feb 2017, 18:58:06 UTC

ID: 28802 · Report as offensive     Reply Quote

Message boards : LHCb Application : Condor exited after 608s without running a job


©2024 CERN