Thread 'Latest versions do not resume after longer suspend'

Author	Message
Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1556 Credit: 10,101,515 RAC: 1,464	Message 36870 - Posted: 25 Sep 2018, 6:55:13 UTC Last modified: 25 Sep 2018, 14:28:32 UTC @Laurence: Reported this at LHC-dev as well on 8 Sep 2018. Any chance the suspend/resume will be fixed in a next version reported in https://lhcathomedev.cern.ch/lhcathome-dev/forum_thread.php?id=418&postid=5459 Tested this again here and noticed that after the resume the events are happily processed until the processes are killed after 40 minutes. Snip of MasterLog: 09/24/18 16:40:19 Setting ready state 'Ready' for STARTD 09/24/18 17:40:14 Preen pid is 11497 09/24/18 17:40:15 DefaultReaper unexpectedly called on pid 11497, status 0. 09/25/18 07:29:11 CCBListener: no activity from CCB server in 32031s; assuming connection is dead. 09/25/18 07:29:11 CCBListener: connection to CCB server vccondor01.cern.ch failed; will try to reconnect in 60 seconds. 09/25/18 07:29:42 condor_write(): Socket closed when trying to write 2319 bytes to collector vccondor01.cern.ch, fd is 12, errno=104 Connection reset by peer 09/25/18 07:29:42 Buf::write(): condor_write() failed 09/25/18 07:30:11 WARNING: forward resolution of 167.142.142.128.in-addr.arpa doesn't match 128.142.142.167! 09/25/18 07:30:12 CCBListener: registered with CCB server vccondor01.cern.ch as ccbid 128.142.142.167:9618?addrs=128.142.142.167-9618+[2001-1458-301-98--100-99]-9618#43666166 Snip of StartLog: 09/24/18 21:28:19 Got universe "VANILLA" (5) from request classad 09/24/18 21:28:19 State change: claim-activation protocol successful 09/24/18 21:28:19 Changing activity: Idle -> Busy 09/25/18 07:29:11 CCBListener: no activity from CCB server in 32151s; assuming connection is dead. 09/25/18 07:29:11 CCBListener: connection to CCB server vccondor01.cern.ch failed; will try to reconnect in 60 seconds. 09/25/18 07:29:14 condor_write(): Socket closed when trying to write 4096 bytes to collector vccondor01.cern.ch, fd is 8, errno=104 Connection reset by peer 09/25/18 07:29:14 Buf::write(): condor_write() failed 09/25/18 07:30:11 CCBListener: registered with CCB server vccondor01.cern.ch as ccbid 128.142.142.167:9618?addrs=128.142.142.167-9618+[2001-1458-301-98--100-99]-9618#43666169 Snip of StarterLog: 09/24/18 21:28:22 (pid:7977) Starting a VANILLA universe job with ID: 472806.1380 09/24/18 21:28:22 (pid:7977) IWD: /var/lib/condor/execute/dir_7977 09/24/18 21:28:22 (pid:7977) Renice expr "10" evaluated to 10 09/24/18 21:28:22 (pid:7977) Using wrapper /usr/local/bin/job-wrapper to exec /var/lib/condor/execute/dir_7977/condor_exec.exe 09/24/18 21:28:22 (pid:7977) Running job as user nobody 09/24/18 21:28:22 (pid:7977) Create_Process succeeded, pid=7981 09/25/18 07:29:40 (pid:7977) CCBListener: no activity from CCB server in 32174s; assuming connection is dead. 09/25/18 07:29:40 (pid:7977) CCBListener: connection to CCB server vccondor01.cern.ch failed; will try to reconnect in 60 seconds. 09/25/18 07:29:40 (pid:7977) condor_read() failed: recv(fd=9) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from <188.184.94.254:9618>. 09/25/18 07:29:40 (pid:7977) IO: Failed to read packet header 09/25/18 07:29:40 (pid:7977) Lost connection to shadow, waiting 2400 secs for reconnect 09/25/18 07:30:40 (pid:7977) WARNING: forward resolution of 167.142.142.128.in-addr.arpa doesn't match 128.142.142.167! 09/25/18 07:30:42 (pid:7977) CCBListener: registered with CCB server vccondor01.cern.ch as ccbid 128.142.142.167:9618?addrs=128.142.142.167-9618+[2001-1458-301-98--100-99]-9618#43672198 09/25/18 08:09:40 (pid:7977) No reconnect from shadow for 2400 seconds, aborting job execution! 09/25/18 08:09:40 (pid:7977) ShutdownFast all jobs. 09/25/18 08:09:40 (pid:7977) Process exited, pid=7981, signal=9 09/25/18 08:09:40 (pid:7977) About to exec Post script: /var/lib/condor/execute/dir_7977/post.sh 2110-625008-582 09/25/18 08:09:40 (pid:7977) Create_Process succeeded, pid=18022 09/25/18 08:09:41 (pid:7977) Process exited, pid=18022, status=0 09/25/18 08:09:42 (pid:7977) Failed to send job exit status to shadow Last lines of running.log after happily processing 40 minutes after the resume: 1000 events processed dumping histograms... Event 1100 ( 1m 34s elapsed / 2h 21m 13s left ) -> ETA: Tue Sep 25 10:30 1100 events processed Event 1200 ( 1m 43s elapsed / 2h 22m 1s left ) -> ETA: Tue Sep 25 10:31 1200 events processed Event 1300 ( 1m 53s elapsed / 2h 23m 15s left ) -> ETA: Tue Sep 25 10:32 1300 events processed Thereafter nothing is happening. No re-connections, no new job, no proper VM-shutdown, no VM kill, only keeping an idle VM until BOINC's 18hours limit jumps in and kill the VM. ID: 36870 · Reply Quote

Laurence Project administrator Project developer Send message Joined: 20 Jun 14 Posts: 431 Credit: 256,248 RAC: 59	Message 36871 - Posted: 25 Sep 2018, 11:57:47 UTC - in response to Message 36870. Last modified: 25 Sep 2018, 11:58:52 UTC Snip of StarterLog: 09/25/18 08:09:40 (pid:7977) No reconnect from shadow for 2400 seconds, aborting job execution! I believe this is the smoking gun. The JobLeaseDuration was too short at the default of 2400s. New jobs should have this value set to 48 hours but we need to get through the old jobs before we can test again. ID: 36871 · Reply Quote

bronco Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0	Message 36872 - Posted: 25 Sep 2018, 16:29:36 UTC - in response to Message 36871. Just 48 hours and then the job lease expires? With a 30 day deadline it's possible for a Theory task to be preempted by other projects for more than 48 hours. Seems like job lease duration should be 30 days. ID: 36872 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1986 Credit: 162,172,797 RAC: 87,016	Message 36873 - Posted: 25 Sep 2018, 16:50:49 UTC - in response to Message 36872. Just 48 hours and then the job lease expires? in fact, the 2400 seconds cited above means actually 40 hours - so even worse. ID: 36873 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1556 Credit: 10,101,515 RAC: 1,464	Message 36874 - Posted: 25 Sep 2018, 17:33:15 UTC Last modified: 25 Sep 2018, 17:35:33 UTC Don't mix up task duration and job duration. BOINC's task expiration time is indeed 30 days and normally a Theory-task needs between 12 and 18 hours elapsed time. Within the VM the real jobs are running and their run-times are varying much, but 80.8% of the jobs are ready within 100 minutes and 95% of the jobs have finished within 200 minutes, so a JobLeaseDuration of 48 hours would be sufficient for most of the suspension periods. ID: 36874 · Reply Quote

bronco Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0	Message 36875 - Posted: 25 Sep 2018, 18:27:59 UTC - in response to Message 36874. Don't mix up task duration and job duration. I don't think I have mixed them up but it's difficult to know whether you are responding to me or to Erich56 since your response "header" does not have an "in response to Message xxxxxxxxx" element. BOINC's task expiration time is indeed 30 days and normally a Theory-task needs between 12 and 18 hours elapsed time. Within the VM the real jobs are running and their run-times are varying much I have studied the finished.log and running.log files extensively so I understand that but 80.8% of the jobs are ready within 100 minutes and 95% of the jobs have finished within 200 minute That I was not aware of, thank you. so a JobLeaseDuration of 48 hours would be sufficient for most of the suspension period Why make the number sufficient for 95% when it could be made sufficient for 99.9% with no additional effort? Is there a rational reason for cutting off that 5%? ID: 36875 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1556 Credit: 10,101,515 RAC: 1,464	Message 36876 - Posted: 25 Sep 2018, 19:42:23 UTC - in response to Message 36875. Why make the number sufficient for 95% when it could be made sufficient for 99.9% with no additional effort? Is there a rational reason for cutting off that 5%? You misunderstand, that 48 hours will cut off 5%. 48 hours will be sufficient for over 99.99% of the jobs, I suppose without evidence. The 95% is the percentage of all jobs ready within 200 minutes. Of 5.7 million jobs only 7 have a run-time longer than 1 day (between 1470 minutes and 2055 minutes) and 5.63 million jobs are ready within 6 hours. 48 hours will not be sufficient for those crunchers suspending a Theory task for longer than 2 days. I think they are very very rare. ID: 36876 · Reply Quote

bronco Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0	Message 36877 - Posted: 25 Sep 2018, 22:14:37 UTC - in response to Message 36876. 48 hours will not be sufficient for those crunchers suspending a Theory task for longer than 2 days. I think they are very very rare. Why it cut off any job that isn't looping/stalled and has not run into the 18 hour run limit or the 30 day deadline? Why? Does it make the database appreciably smaller? The fact that it cuts of very, very few doesn't make any sense unless there is some benefit to be derived from doing so. ID: 36877 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2755 Credit: 304,271,457 RAC: 116,232	Message 36879 - Posted: 26 Sep 2018, 6:26:41 UTC - in response to Message 36877. Why it cut off any job that isn't looping/stalled ... Because it may not be possible to detect all kind of endless loops. You may treat it like a car airbag. Most people will never need it, but some are very happy that they had one. Well, I guess I know your answer: "I never needed an airbag since I made my driving licence on one of Pharaoh Ramses' brand-new chariots." ID: 36879 · Reply Quote

bronco Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0	Message 36881 - Posted: 26 Sep 2018, 7:37:30 UTC - in response to Message 36879. Why it cut off any job that isn't looping/stalled ... Because it may not be possible to detect all kind of endless loops. If a job is looping it will most certainly end at the 18 hour task limit when the VM is destroyed. We all know that so why even bother stating the obvious? To deflect from the fact that you cannot answer my question and give me a REAL reason to cut off a job that may be viable simply because it has been suspended for more than 2 days? You may treat it like a car airbag. Airbag is a good analogy. You would deploy the airbag even when there is no collision detected. Most people will never need it, but some are very happy that they had one. Happy when the airbag deploys in a collision. Not so happy when it deploys when there is no collision. Well, I guess I know your answer: "I never needed an airbag since I made my driving licence on one of Pharaoh Ramses' brand-new chariots." ROFLMAO Wrong again. ID: 36881 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1556 Credit: 10,101,515 RAC: 1,464	Message 36882 - Posted: 26 Sep 2018, 7:50:16 UTC - in response to Message 36881. Last modified: 26 Sep 2018, 7:52:07 UTC ... give me a REAL reason to cut off a job that may be viable simply because it has been suspended for more than 2 days? I can't give a/the REAL reason, cause I don't know all ins and outs of CERN's processing flows, but I can imagine that when one have to wait for 30 days, there would stay too many connections open. 48 hours is arbitrary, but 40 minutes was really to short for us BOINC-crunchers. Keep in mind that most Theory jobs are not done by BOINC. ID: 36882 · Reply Quote

bronco Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0	Message 36884 - Posted: 26 Sep 2018, 8:19:23 UTC - in response to Message 36882. Yes, 48 hours is much better than 40 minutes. Too many connections open makes sense to me. Thanks. ID: 36884 · Reply Quote

Laurence Project administrator Project developer Send message Joined: 20 Jun 14 Posts: 431 Credit: 256,248 RAC: 59	Message 36885 - Posted: 26 Sep 2018, 8:22:04 UTC - in response to Message 36882. Last modified: 26 Sep 2018, 8:22:43 UTC Yes 48 hours is arbitrary. The use case is supporting an overnight suspend/resume and we can extend this to one day. So 48 hours should be long enough to support a 24 hour suspend/resume. In general we do wan't the results of a specific job so there needs to be a cutoff period where we consider the job lost. ID: 36885 · Reply Quote

Laurence Project administrator Project developer Send message Joined: 20 Jun 14 Posts: 431 Credit: 256,248 RAC: 59	Message 36886 - Posted: 26 Sep 2018, 8:22:27 UTC - in response to Message 36885. Am testing the suspend/resume now. ID: 36886 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1556 Credit: 10,101,515 RAC: 1,464	Message 36887 - Posted: 26 Sep 2018, 9:29:36 UTC - in response to Message 36886. Am testing the suspend/resume now. The jobs with the old setting are gone: 09/26/18 11:21:57 (pid:4348) CCBListener: no activity from CCB server in 4949s; assuming connection is dead. 09/26/18 11:21:57 (pid:4348) CCBListener: connection to CCB server vccondor01.cern.ch failed; will try to reconnect in 60 seconds. 09/26/18 11:21:57 (pid:4348) condor_read() failed: recv(fd=9) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from <188.184.94.254:9618>. 09/26/18 11:21:57 (pid:4348) IO: Failed to read packet header 09/26/18 11:21:57 (pid:4348) Lost connection to shadow, waiting 172800 secs for reconnect Interesting of course is whether the finished jobs after the resume are arriving and used. ID: 36887 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1556 Credit: 10,101,515 RAC: 1,464	Message 36888 - Posted: 26 Sep 2018, 9:54:10 UTC - in response to Message 36887. Interesting of course is whether the finished jobs after the resume are arriving and used. The first job after the resume has finished, but until now I don't get a new job and after 10 minutes the VM was killed. 09/26/18 09:39:29 (pid:4348) Running job as user nobody 09/26/18 09:39:29 (pid:4348) Create_Process succeeded, pid=4352 09/26/18 11:21:57 (pid:4348) CCBListener: no activity from CCB server in 4949s; assuming connection is dead. 09/26/18 11:21:57 (pid:4348) CCBListener: connection to CCB server vccondor01.cern.ch failed; will try to reconnect in 60 seconds. 09/26/18 11:21:57 (pid:4348) condor_read() failed: recv(fd=9) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from <188.184.94.254:9618>. 09/26/18 11:21:57 (pid:4348) IO: Failed to read packet header 09/26/18 11:21:57 (pid:4348) Lost connection to shadow, waiting 172800 secs for reconnect 09/26/18 11:22:58 (pid:4348) WARNING: forward resolution of 167.142.142.128.in-addr.arpa doesn't match 128.142.142.167! 09/26/18 11:22:58 (pid:4348) CCBListener: registered with CCB server vccondor01.cern.ch as ccbid 128.142.142.167:9618?addrs=128.142.142.167-9618+[2001-1458-301-98--100-99]-9618#43717470 09/26/18 11:23:03 (pid:4348) Accepted request to reconnect from <:0> 09/26/18 11:23:03 (pid:4348) Ignoring old shadow <188.184.94.254:9618?addrs=188.184.94.254-9618+[2001-1458-201-e4--100-6f8]-9618&noUDP&sock=31213_21ce_219360> 09/26/18 11:23:03 (pid:4348) Communicating with shadow <188.184.94.254:9618?addrs=188.184.94.254-9618+[2001-1458-201-e4--100-6f8]-9618&noUDP&sock=31213_21ce_219360> 09/26/18 11:23:03 (pid:4348) Recovered connection to shadow after 66 seconds 09/26/18 11:41:01 (pid:4348) Process exited, pid=4352, status=0 09/26/18 11:41:01 (pid:4348) About to exec Post script: /var/lib/condor/execute/dir_4348/post.sh 2110-611195-584 09/26/18 11:41:01 (pid:4348) Create_Process succeeded, pid=12979 09/26/18 11:41:02 (pid:4348) Process exited, pid=12979, status=0 09/26/18 11:41:03 (pid:4348) Got SIGQUIT. Performing fast shutdown. 09/26/18 11:41:03 (pid:4348) ShutdownFast all jobs. 09/26/18 11:41:03 (pid:4348) **** condor_starter (condor_STARTER) pid 4348 EXITING WITH STATUS 0 ID: 36888 · Reply Quote

Laurence Project administrator Project developer Send message Joined: 20 Jun 14 Posts: 431 Credit: 256,248 RAC: 59	Message 36890 - Posted: 26 Sep 2018, 11:27:35 UTC - in response to Message 36888. I got a new new job. ID: 36890 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1556 Credit: 10,101,515 RAC: 1,464	Message 36891 - Posted: 26 Sep 2018, 11:32:58 UTC With the next try after a longer suspend and resume, a new job arrived after condor_starter was started up again. 09/26/18 10:09:26 (pid:4349) Create_Process succeeded, pid=4353 09/26/18 12:59:44 (pid:4349) CCBListener: no activity from CCB server in 5718s; assuming connection is dead. 09/26/18 12:59:44 (pid:4349) CCBListener: connection to CCB server vccondor01.cern.ch failed; will try to reconnect in 60 seconds. 09/26/18 12:59:44 (pid:4349) condor_read() failed: recv(fd=9) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from <188.184.94.254:9618>. 09/26/18 12:59:44 (pid:4349) IO: Failed to read packet header 09/26/18 12:59:44 (pid:4349) Lost connection to shadow, waiting 172800 secs for reconnect 09/26/18 13:00:45 (pid:4349) WARNING: forward resolution of 167.142.142.128.in-addr.arpa doesn't match 128.142.142.167! 09/26/18 13:00:45 (pid:4349) CCBListener: registered with CCB server vccondor01.cern.ch as ccbid 128.142.142.167:9618?addrs=128.142.142.167-9618+[2001-1458-301-98--100-99]-9618#43718333 09/26/18 13:01:27 (pid:4349) Accepted request to reconnect from <:0> 09/26/18 13:01:27 (pid:4349) Ignoring old shadow <188.184.94.254:9618?addrs=188.184.94.254-9618+[2001-1458-201-e4--100-6f8]-9618&noUDP&sock=31213_21ce_219914> 09/26/18 13:01:27 (pid:4349) Communicating with shadow <188.184.94.254:9618?addrs=188.184.94.254-9618+[2001-1458-201-e4--100-6f8]-9618&noUDP&sock=31213_21ce_219914> 09/26/18 13:01:27 (pid:4349) Recovered connection to shadow after 103 seconds 09/26/18 13:24:27 (pid:4349) Process exited, pid=4353, status=0 09/26/18 13:24:27 (pid:4349) About to exec Post script: /var/lib/condor/execute/dir_4349/post.sh 2110-617475-584 09/26/18 13:24:27 (pid:4349) Create_Process succeeded, pid=14955 09/26/18 13:24:27 (pid:4349) Process exited, pid=14955, status=0 09/26/18 13:24:29 (pid:4349) Got SIGQUIT. Performing fast shutdown. 09/26/18 13:24:29 (pid:4349) ShutdownFast all jobs. 09/26/18 13:24:29 (pid:4349) ** condor_starter (condor_STARTER) pid 4349 EXITING WITH STATUS 0 09/26/18 13:24:29 (pid:14959) ************************************************** 09/26/18 13:24:29 (pid:14959) condor_starter (CONDOR_STARTER) STARTING UP 09/26/18 13:24:29 (pid:14959) /usr/sbin/condor_starter 09/26/18 13:24:29 (pid:14959) SubsystemInfo: name=STARTER type=STARTER(8) class=DAEMON(1) 09/26/18 13:24:29 (pid:14959) Configuration: subsystem:STARTER local:<NONE> class:DAEMON 09/26/18 13:24:29 (pid:14959) $CondorVersion: 8.6.10 Mar 12 2018 BuildID: 435200 $ 09/26/18 13:24:29 (pid:14959) $CondorPlatform: x86_64_RedHat6 $ 09/26/18 13:24:29 (pid:14959) PID = 14959 09/26/18 13:24:29 (pid:14959) Log last touched 9/26 13:24:29 09/26/18 13:24:29 (pid:14959) **************************************************** ID: 36891 · Reply Quote

bronco Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0	Message 36892 - Posted: 26 Sep 2018, 15:35:55 UTC Last modified: 26 Sep 2018, 15:49:47 UTC I received a new Theory task at about 2:59 this morning. I allowed it to run for several minutes to allow it to be assigned a Condor ID then I suspended it. At approximately 9:10 I resumed the task and it seems to have resumed the same job that was suspended at 2:59. See snippet from StarterLog below. 9/26/18 02:59:40 (pid:4260) IWD: /var/lib/condor/execute/dir_4260 09/26/18 02:59:40 (pid:4260) Renice expr "10" evaluated to 10 09/26/18 02:59:40 (pid:4260) Using wrapper /usr/local/bin/job-wrapper to exec /var/lib/condor/execute/dir_4260/condor_exec.exe 09/26/18 02:59:40 (pid:4260) Running job as user nobody 09/26/18 02:59:41 (pid:4260) Create_Process succeeded, pid=4264 09/26/18 09:12:59 (pid:4260) CCBListener: no activity from CCB server in 21202s; assuming connection is dead. 09/26/18 09:12:59 (pid:4260) CCBListener: connection to CCB server vccondor01.cern.ch failed; will try to reconnect in 60 seconds. 09/26/18 09:12:59 (pid:4260) condor_read() failed: recv(fd=9) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from <188.184.94.254:9618>. 09/26/18 09:12:59 (pid:4260) IO: Failed to read packet header 09/26/18 09:12:59 (pid:4260) Lost connection to shadow, waiting 172800 secs for reconnect 09/26/18 09:14:02 (pid:4260) CCBListener: registered with CCB server vccondor01.cern.ch as ccbid 128.142.142.167:9618?addrs=128.142.142.167-9618+[2001-1458-301-98--100-99]-9618#43719876 09/26/18 09:15:09 (pid:4260) Accepted request to reconnect from <:0> 09/26/18 09:15:09 (pid:4260) Ignoring old shadow <188.184.94.254:9618?addrs=188.184.94.254-9618+[2001-1458-201-e4--100-6f8]-9618&noUDP&sock=31213_21ce_220981> 09/26/18 09:15:09 (pid:4260) Communicating with shadow <188.184.94.254:9618?addrs=188.184.94.254-9618+[2001-1458-201-e4--100-6f8]-9618&noUDP&sock=31213_21ce_220981> 09/26/18 09:15:09 (pid:4260) Recovered connection to shadow after 130 seconds edit: I didn't think to record the event number before suspending the task but the job seems to have resumed at where I estimate the event number would have been when I suspended it. I'll try another one. ID: 36892 · Reply Quote