Lost connection to shadow

Author	Message
Juha Send message Joined: 22 Mar 17 Posts: 30 Credit: 360,676 RAC: 0	Message 29892 - Posted: 10 Apr 2017, 19:37:14 UTC I have one task that was ostensibly running but not using any CPU time. Looking at logs I see this in StarterLog: 04/10/17 07:54:52 (pid:4070) About to exec Post script: /var/lib/condor/execute/dir_4070/tarOutput.sh 2016-563074-880 04/10/17 07:54:52 (pid:4070) Create_Process succeeded, pid=4974 04/10/17 07:54:52 (pid:4070) Process exited, pid=4974, status=0 04/10/17 07:54:53 (pid:4070) Connection to shadow may be lost, will test by sending whoami request. 04/10/17 07:54:53 (pid:4070) condor_write(): Socket closed when trying to write 37 bytes to <188.184.94.254:9618>, fd is 8 04/10/17 07:54:53 (pid:4070) Buf::write(): condor_write() failed 04/10/17 07:54:53 (pid:4070) i/o error result is 0, errno is 0 04/10/17 07:54:53 (pid:4070) Lost connection to shadow, waiting 86300 secs for reconnect 04/10/17 07:55:41 (pid:4070) CCBListener: registered with CCB server vccondor01.cern.ch as ccbid 128.142.142.167:9618?addrs=128.142.142.167-9618+[2001-1458-301-98--100-99]-9618#6718020 And in StartLog: 04/10/17 06:51:10 Changing activity: Idle -> Busy 04/10/17 07:54:28 CCBListener: no activity from CCB server in 3798s; assuming connection is dead. 04/10/17 07:54:28 CCBListener: connection to CCB server vccondor01.cern.ch failed; will try to reconnect in 60 seconds. 04/10/17 07:54:40 PERMISSION DENIED to condor@486149-10452516-20901 from host 10.0.2.15 for command 448 (GIVE_STATE), access level READ: reason: READ authorization policy contains no matching ALLOW entry for this request; identifiers used for this host: 10.0.2.15,10.0.2.15, hostname size = 1, original ip address = 10.0.2.15 04/10/17 07:54:40 DC_AUTHENTICATE: Command not authorized, done! 04/10/17 07:55:29 CCBListener: registered with CCB server vccondor01.cern.ch as ccbid 128.142.142.167:9618?addrs=128.142.142.167-9618+[2001-1458-301-98--100-99]-9618#6718006 BOINC had pre-empted the task for an hour during that time: 2017-04-10 06:51:12 (11184): Guest Log: [INFO] New Job Starting in slot1 2017-04-10 06:51:12 (11184): Guest Log: [INFO] Condor JobID: 2567929.0 in slot1 2017-04-10 06:51:22 (11184): Guest Log: [INFO] MCPlots JobID: 36306186 in slot1 2017-04-10 06:53:42 (11184): VM state change detected. (old = 'running', new = 'paused') 2017-04-10 07:54:16 (11184): VM state change detected. (old = 'paused', new = 'running') 2017-04-10 07:54:56 (11184): Guest Log: [INFO] Job finished in slot1 with 0. I don't know if the task being pre-empted is significant or not. I've had a similar problem at least once before where it seemed that the problem with connecting to shadow appeared shortly after the task was resumed but then again, I have had lots of tasks being pre-empted and resumed without any problems. Anyway, the real killer seems to be the 86300 reconnect interval. That's a bit much in BOINC environment. I'd have it try to reconnect once per minute a few times. The host may have been powered down or suspended and might not have network available just yet. If it can't reconnect after, say, ten minutes just give up and power down the VM. ID: 29892 · Reply Quote

Author

Message

Juha

Send message
Joined: 22 Mar 17
Posts: 30
Credit: 360,676
RAC: 0
Joined ATLAS for over 1800 days

I have one task that was ostensibly running but not using any CPU time. Looking at logs I see this in StarterLog:

04/10/17 07:54:52 (pid:4070) About to exec Post script: /var/lib/condor/execute/dir_4070/tarOutput.sh 2016-563074-880
04/10/17 07:54:52 (pid:4070) Create_Process succeeded, pid=4974
04/10/17 07:54:52 (pid:4070) Process exited, pid=4974, status=0
04/10/17 07:54:53 (pid:4070) Connection to shadow may be lost, will test by sending whoami request.
04/10/17 07:54:53 (pid:4070) condor_write(): Socket closed when trying to write 37 bytes to <188.184.94.254:9618>, fd is 8
04/10/17 07:54:53 (pid:4070) Buf::write(): condor_write() failed
04/10/17 07:54:53 (pid:4070) i/o error result is 0, errno is 0
04/10/17 07:54:53 (pid:4070) Lost connection to shadow, waiting 86300 secs for reconnect
04/10/17 07:55:41 (pid:4070) CCBListener: registered with CCB server vccondor01.cern.ch as ccbid 128.142.142.167:9618?addrs=128.142.142.167-9618+[2001-1458-301-98--100-99]-9618#6718020

And in StartLog:

04/10/17 06:51:10 Changing activity: Idle -> Busy
04/10/17 07:54:28 CCBListener: no activity from CCB server in 3798s; assuming connection is dead.
04/10/17 07:54:28 CCBListener: connection to CCB server vccondor01.cern.ch failed; will try to reconnect in 60 seconds.
04/10/17 07:54:40 PERMISSION DENIED to condor@486149-10452516-20901 from host 10.0.2.15 for command 448 (GIVE_STATE), access level READ: reason: READ authorization policy contains no matching ALLOW entry for this request; identifiers used for this host: 10.0.2.15,10.0.2.15, hostname size = 1, original ip address = 10.0.2.15
04/10/17 07:54:40 DC_AUTHENTICATE: Command not authorized, done!
04/10/17 07:55:29 CCBListener: registered with CCB server vccondor01.cern.ch as ccbid 128.142.142.167:9618?addrs=128.142.142.167-9618+[2001-1458-301-98--100-99]-9618#6718006

BOINC had pre-empted the task for an hour during that time:

2017-04-10 06:51:12 (11184): Guest Log: [INFO] New Job Starting in slot1
2017-04-10 06:51:12 (11184): Guest Log: [INFO] Condor JobID:  2567929.0 in slot1
2017-04-10 06:51:22 (11184): Guest Log: [INFO] MCPlots JobID: 36306186 in slot1
2017-04-10 06:53:42 (11184): VM state change detected. (old = 'running', new = 'paused')
2017-04-10 07:54:16 (11184): VM state change detected. (old = 'paused', new = 'running')
2017-04-10 07:54:56 (11184): Guest Log: [INFO] Job finished in slot1 with 0.

I don't know if the task being pre-empted is significant or not. I've had a similar problem at least once before where it seemed that the problem with connecting to shadow appeared shortly after the task was resumed but then again, I have had lots of tasks being pre-empted and resumed without any problems.

Anyway, the real killer seems to be the 86300 reconnect interval. That's a bit much in BOINC environment. I'd have it try to reconnect once per minute a few times. The host may have been powered down or suspended and might not have network available just yet. If it can't reconnect after, say, ten minutes just give up and power down the VM.

ID: 29892 ·

Reply Quote

LHC@home