Lost connection to shadow
I have one task that was ostensibly running but not using any CPU time. Looking at logs I see this in StarterLog:
04/10/17 07:54:52 (pid:4070) About to exec Post script: /var/lib/condor/execute/dir_4070/tarOutput.sh 2016-563074-880 04/10/17 07:54:52 (pid:4070) Create_Process succeeded, pid=4974 04/10/17 07:54:52 (pid:4070) Process exited, pid=4974, status=0 04/10/17 07:54:53 (pid:4070) Connection to shadow may be lost, will test by sending whoami request. 04/10/17 07:54:53 (pid:4070) condor_write(): Socket closed when trying to write 37 bytes to <18.104.22.168:9618>, fd is 8 04/10/17 07:54:53 (pid:4070) Buf::write(): condor_write() failed 04/10/17 07:54:53 (pid:4070) i/o error result is 0, errno is 0 04/10/17 07:54:53 (pid:4070) Lost connection to shadow, waiting 86300 secs for reconnect 04/10/17 07:55:41 (pid:4070) CCBListener: registered with CCB server vccondor01.cern.ch as ccbid 22.214.171.124:9618?addrs=126.96.36.199-9618+[2001-1458-301-98--100-99]-9618#6718020
And in StartLog:
04/10/17 06:51:10 Changing activity: Idle -> Busy 04/10/17 07:54:28 CCBListener: no activity from CCB server in 3798s; assuming connection is dead. 04/10/17 07:54:28 CCBListener: connection to CCB server vccondor01.cern.ch failed; will try to reconnect in 60 seconds. 04/10/17 07:54:40 PERMISSION DENIED to condor@486149-10452516-20901 from host 10.0.2.15 for command 448 (GIVE_STATE), access level READ: reason: READ authorization policy contains no matching ALLOW entry for this request; identifiers used for this host: 10.0.2.15,10.0.2.15, hostname size = 1, original ip address = 10.0.2.15 04/10/17 07:54:40 DC_AUTHENTICATE: Command not authorized, done! 04/10/17 07:55:29 CCBListener: registered with CCB server vccondor01.cern.ch as ccbid 188.8.131.52:9618?addrs=184.108.40.206-9618+[2001-1458-301-98--100-99]-9618#6718006
BOINC had pre-empted the task for an hour during that time:
2017-04-10 06:51:12 (11184): Guest Log: [INFO] New Job Starting in slot1 2017-04-10 06:51:12 (11184): Guest Log: [INFO] Condor JobID: 2567929.0 in slot1 2017-04-10 06:51:22 (11184): Guest Log: [INFO] MCPlots JobID: 36306186 in slot1 2017-04-10 06:53:42 (11184): VM state change detected. (old = 'running', new = 'paused') 2017-04-10 07:54:16 (11184): VM state change detected. (old = 'paused', new = 'running') 2017-04-10 07:54:56 (11184): Guest Log: [INFO] Job finished in slot1 with 0.
I don't know if the task being pre-empted is significant or not. I've had a similar problem at least once before where it seemed that the problem with connecting to shadow appeared shortly after the task was resumed but then again, I have had lots of tasks being pre-empted and resumed without any problems.
Anyway, the real killer seems to be the 86300 reconnect interval. That's a bit much in BOINC environment. I'd have it try to reconnect once per minute a few times. The host may have been powered down or suspended and might not have network available just yet. If it can't reconnect after, say, ten minutes just give up and power down the VM.
|ID: 29892 · Reply Quote|