Message boards :
Theory Application :
Latest versions do not resume after longer suspend
Message board moderation
Author | Message |
---|---|
Send message Joined: 14 Jan 10 Posts: 1286 Credit: 8,515,990 RAC: 2,442 |
@Laurence: Reported this at LHC-dev as well on 8 Sep 2018. Any chance the suspend/resume will be fixed in a next version reported in https://lhcathomedev.cern.ch/lhcathome-dev/forum_thread.php?id=418&postid=5459 Tested this again here and noticed that after the resume the events are happily processed until the processes are killed after 40 minutes. Snip of MasterLog: 09/24/18 16:40:19 Setting ready state 'Ready' for STARTD 09/24/18 17:40:14 Preen pid is 11497 09/24/18 17:40:15 DefaultReaper unexpectedly called on pid 11497, status 0. 09/25/18 07:29:11 CCBListener: no activity from CCB server in 32031s; assuming connection is dead. 09/25/18 07:29:11 CCBListener: connection to CCB server vccondor01.cern.ch failed; will try to reconnect in 60 seconds. 09/25/18 07:29:42 condor_write(): Socket closed when trying to write 2319 bytes to collector vccondor01.cern.ch, fd is 12, errno=104 Connection reset by peer 09/25/18 07:29:42 Buf::write(): condor_write() failed 09/25/18 07:30:11 WARNING: forward resolution of 167.142.142.128.in-addr.arpa doesn't match 128.142.142.167! 09/25/18 07:30:12 CCBListener: registered with CCB server vccondor01.cern.ch as ccbid 128.142.142.167:9618?addrs=128.142.142.167-9618+[2001-1458-301-98--100-99]-9618#43666166 Snip of StartLog: 09/24/18 21:28:19 Got universe "VANILLA" (5) from request classad 09/24/18 21:28:19 State change: claim-activation protocol successful 09/24/18 21:28:19 Changing activity: Idle -> Busy 09/25/18 07:29:11 CCBListener: no activity from CCB server in 32151s; assuming connection is dead. 09/25/18 07:29:11 CCBListener: connection to CCB server vccondor01.cern.ch failed; will try to reconnect in 60 seconds. 09/25/18 07:29:14 condor_write(): Socket closed when trying to write 4096 bytes to collector vccondor01.cern.ch, fd is 8, errno=104 Connection reset by peer 09/25/18 07:29:14 Buf::write(): condor_write() failed 09/25/18 07:30:11 CCBListener: registered with CCB server vccondor01.cern.ch as ccbid 128.142.142.167:9618?addrs=128.142.142.167-9618+[2001-1458-301-98--100-99]-9618#43666169 Snip of StarterLog: 09/24/18 21:28:22 (pid:7977) Starting a VANILLA universe job with ID: 472806.1380 09/24/18 21:28:22 (pid:7977) IWD: /var/lib/condor/execute/dir_7977 09/24/18 21:28:22 (pid:7977) Renice expr "10" evaluated to 10 09/24/18 21:28:22 (pid:7977) Using wrapper /usr/local/bin/job-wrapper to exec /var/lib/condor/execute/dir_7977/condor_exec.exe 09/24/18 21:28:22 (pid:7977) Running job as user nobody 09/24/18 21:28:22 (pid:7977) Create_Process succeeded, pid=7981 09/25/18 07:29:40 (pid:7977) CCBListener: no activity from CCB server in 32174s; assuming connection is dead. 09/25/18 07:29:40 (pid:7977) CCBListener: connection to CCB server vccondor01.cern.ch failed; will try to reconnect in 60 seconds. 09/25/18 07:29:40 (pid:7977) condor_read() failed: recv(fd=9) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from <188.184.94.254:9618>. 09/25/18 07:29:40 (pid:7977) IO: Failed to read packet header 09/25/18 07:29:40 (pid:7977) Lost connection to shadow, waiting 2400 secs for reconnect 09/25/18 07:30:40 (pid:7977) WARNING: forward resolution of 167.142.142.128.in-addr.arpa doesn't match 128.142.142.167! 09/25/18 07:30:42 (pid:7977) CCBListener: registered with CCB server vccondor01.cern.ch as ccbid 128.142.142.167:9618?addrs=128.142.142.167-9618+[2001-1458-301-98--100-99]-9618#43672198 09/25/18 08:09:40 (pid:7977) No reconnect from shadow for 2400 seconds, aborting job execution! 09/25/18 08:09:40 (pid:7977) ShutdownFast all jobs. 09/25/18 08:09:40 (pid:7977) Process exited, pid=7981, signal=9 09/25/18 08:09:40 (pid:7977) About to exec Post script: /var/lib/condor/execute/dir_7977/post.sh 2110-625008-582 09/25/18 08:09:40 (pid:7977) Create_Process succeeded, pid=18022 09/25/18 08:09:41 (pid:7977) Process exited, pid=18022, status=0 09/25/18 08:09:42 (pid:7977) Failed to send job exit status to shadow Last lines of running.log after happily processing 40 minutes after the resume: 1000 events processed dumping histograms... Event 1100 ( 1m 34s elapsed / 2h 21m 13s left ) -> ETA: Tue Sep 25 10:30 1100 events processed Event 1200 ( 1m 43s elapsed / 2h 22m 1s left ) -> ETA: Tue Sep 25 10:31 1200 events processed Event 1300 ( 1m 53s elapsed / 2h 23m 15s left ) -> ETA: Tue Sep 25 10:32 1300 events processed Thereafter nothing is happening. No re-connections, no new job, no proper VM-shutdown, no VM kill, only keeping an idle VM until BOINC's 18hours limit jumps in and kill the VM. |
Send message Joined: 20 Jun 14 Posts: 380 Credit: 238,712 RAC: 0 |
Snip of StarterLog:09/25/18 08:09:40 (pid:7977) No reconnect from shadow for 2400 seconds, aborting job execution! I believe this is the smoking gun. The JobLeaseDuration was too short at the default of 2400s. New jobs should have this value set to 48 hours but we need to get through the old jobs before we can test again. |
Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0 |
Just 48 hours and then the job lease expires? With a 30 day deadline it's possible for a Theory task to be preempted by other projects for more than 48 hours. Seems like job lease duration should be 30 days. |
Send message Joined: 18 Dec 15 Posts: 1691 Credit: 104,607,009 RAC: 101,315 |
Just 48 hours and then the job lease expires?in fact, the 2400 seconds cited above means actually 40 hours - so even worse. |
Send message Joined: 14 Jan 10 Posts: 1286 Credit: 8,515,990 RAC: 2,442 |
Don't mix up task duration and job duration. BOINC's task expiration time is indeed 30 days and normally a Theory-task needs between 12 and 18 hours elapsed time. Within the VM the real jobs are running and their run-times are varying much, but 80.8% of the jobs are ready within 100 minutes and 95% of the jobs have finished within 200 minutes, so a JobLeaseDuration of 48 hours would be sufficient for most of the suspension periods. |
Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0 |
Don't mix up task duration and job duration. I don't think I have mixed them up but it's difficult to know whether you are responding to me or to Erich56 since your response "header" does not have an "in response to Message xxxxxxxxx" element. BOINC's task expiration time is indeed 30 days and normally a Theory-task needs between 12 and 18 hours elapsed time. I have studied the finished.log and running.log files extensively so I understand that but 80.8% of the jobs are ready within 100 minutes and 95% of the jobs have finished within 200 minute That I was not aware of, thank you. so a JobLeaseDuration of 48 hours would be sufficient for most of the suspension period Why make the number sufficient for 95% when it could be made sufficient for 99.9% with no additional effort? Is there a rational reason for cutting off that 5%? |
Send message Joined: 14 Jan 10 Posts: 1286 Credit: 8,515,990 RAC: 2,442 |
Why make the number sufficient for 95% when it could be made sufficient for 99.9% with no additional effort? Is there a rational reason for cutting off that 5%?You misunderstand, that 48 hours will cut off 5%. 48 hours will be sufficient for over 99.99% of the jobs, I suppose without evidence. The 95% is the percentage of all jobs ready within 200 minutes. Of 5.7 million jobs only 7 have a run-time longer than 1 day (between 1470 minutes and 2055 minutes) and 5.63 million jobs are ready within 6 hours. 48 hours will not be sufficient for those crunchers suspending a Theory task for longer than 2 days. I think they are very very rare. |
Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0 |
48 hours will not be sufficient for those crunchers suspending a Theory task for longer than 2 days. I think they are very very rare. Why it cut off any job that isn't looping/stalled and has not run into the 18 hour run limit or the 30 day deadline? Why? Does it make the database appreciably smaller? The fact that it cuts of very, very few doesn't make any sense unless there is some benefit to be derived from doing so. |
Send message Joined: 15 Jun 08 Posts: 2425 Credit: 227,475,238 RAC: 129,792 |
Why it cut off any job that isn't looping/stalled ... Because it may not be possible to detect all kind of endless loops. You may treat it like a car airbag. Most people will never need it, but some are very happy that they had one. Well, I guess I know your answer: "I never needed an airbag since I made my driving licence on one of Pharaoh Ramses' brand-new chariots." |
Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0 |
Why it cut off any job that isn't looping/stalled ... If a job is looping it will most certainly end at the 18 hour task limit when the VM is destroyed. We all know that so why even bother stating the obvious? To deflect from the fact that you cannot answer my question and give me a REAL reason to cut off a job that may be viable simply because it has been suspended for more than 2 days? You may treat it like a car airbag. Airbag is a good analogy. You would deploy the airbag even when there is no collision detected. Most people will never need it, but some are very happy that they had one. Happy when the airbag deploys in a collision. Not so happy when it deploys when there is no collision. Well, I guess I know your answer: ROFLMAO Wrong again. |
Send message Joined: 14 Jan 10 Posts: 1286 Credit: 8,515,990 RAC: 2,442 |
... give me a REAL reason to cut off a job that may be viable simply because it has been suspended for more than 2 days? I can't give a/the REAL reason, cause I don't know all ins and outs of CERN's processing flows, but I can imagine that when one have to wait for 30 days, there would stay too many connections open. 48 hours is arbitrary, but 40 minutes was really to short for us BOINC-crunchers. Keep in mind that most Theory jobs are not done by BOINC. |
Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0 |
Yes, 48 hours is much better than 40 minutes. Too many connections open makes sense to me. Thanks. |
Send message Joined: 20 Jun 14 Posts: 380 Credit: 238,712 RAC: 0 |
Yes 48 hours is arbitrary. The use case is supporting an overnight suspend/resume and we can extend this to one day. So 48 hours should be long enough to support a 24 hour suspend/resume. In general we do wan't the results of a specific job so there needs to be a cutoff period where we consider the job lost. |
Send message Joined: 20 Jun 14 Posts: 380 Credit: 238,712 RAC: 0 |
Am testing the suspend/resume now. |
Send message Joined: 14 Jan 10 Posts: 1286 Credit: 8,515,990 RAC: 2,442 |
Am testing the suspend/resume now. The jobs with the old setting are gone: 09/26/18 11:21:57 (pid:4348) CCBListener: no activity from CCB server in 4949s; assuming connection is dead. 09/26/18 11:21:57 (pid:4348) CCBListener: connection to CCB server vccondor01.cern.ch failed; will try to reconnect in 60 seconds. 09/26/18 11:21:57 (pid:4348) condor_read() failed: recv(fd=9) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from <188.184.94.254:9618>. 09/26/18 11:21:57 (pid:4348) IO: Failed to read packet header 09/26/18 11:21:57 (pid:4348) Lost connection to shadow, waiting 172800 secs for reconnect Interesting of course is whether the finished jobs after the resume are arriving and used. |
Send message Joined: 14 Jan 10 Posts: 1286 Credit: 8,515,990 RAC: 2,442 |
Interesting of course is whether the finished jobs after the resume are arriving and used.The first job after the resume has finished, but until now I don't get a new job and after 10 minutes the VM was killed. 09/26/18 09:39:29 (pid:4348) Running job as user nobody 09/26/18 09:39:29 (pid:4348) Create_Process succeeded, pid=4352 09/26/18 11:21:57 (pid:4348) CCBListener: no activity from CCB server in 4949s; assuming connection is dead. 09/26/18 11:21:57 (pid:4348) CCBListener: connection to CCB server vccondor01.cern.ch failed; will try to reconnect in 60 seconds. 09/26/18 11:21:57 (pid:4348) condor_read() failed: recv(fd=9) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from <188.184.94.254:9618>. 09/26/18 11:21:57 (pid:4348) IO: Failed to read packet header 09/26/18 11:21:57 (pid:4348) Lost connection to shadow, waiting 172800 secs for reconnect 09/26/18 11:22:58 (pid:4348) WARNING: forward resolution of 167.142.142.128.in-addr.arpa doesn't match 128.142.142.167! 09/26/18 11:22:58 (pid:4348) CCBListener: registered with CCB server vccondor01.cern.ch as ccbid 128.142.142.167:9618?addrs=128.142.142.167-9618+[2001-1458-301-98--100-99]-9618#43717470 09/26/18 11:23:03 (pid:4348) Accepted request to reconnect from <:0> 09/26/18 11:23:03 (pid:4348) Ignoring old shadow <188.184.94.254:9618?addrs=188.184.94.254-9618+[2001-1458-201-e4--100-6f8]-9618&noUDP&sock=31213_21ce_219360> 09/26/18 11:23:03 (pid:4348) Communicating with shadow <188.184.94.254:9618?addrs=188.184.94.254-9618+[2001-1458-201-e4--100-6f8]-9618&noUDP&sock=31213_21ce_219360> 09/26/18 11:23:03 (pid:4348) Recovered connection to shadow after 66 seconds 09/26/18 11:41:01 (pid:4348) Process exited, pid=4352, status=0 09/26/18 11:41:01 (pid:4348) About to exec Post script: /var/lib/condor/execute/dir_4348/post.sh 2110-611195-584 09/26/18 11:41:01 (pid:4348) Create_Process succeeded, pid=12979 09/26/18 11:41:02 (pid:4348) Process exited, pid=12979, status=0 09/26/18 11:41:03 (pid:4348) Got SIGQUIT. Performing fast shutdown. 09/26/18 11:41:03 (pid:4348) ShutdownFast all jobs. 09/26/18 11:41:03 (pid:4348) **** condor_starter (condor_STARTER) pid 4348 EXITING WITH STATUS 0 |
Send message Joined: 20 Jun 14 Posts: 380 Credit: 238,712 RAC: 0 |
I got a new new job. |
Send message Joined: 14 Jan 10 Posts: 1286 Credit: 8,515,990 RAC: 2,442 |
With the next try after a longer suspend and resume, a new job arrived after condor_starter was started up again. 09/26/18 10:09:26 (pid:4349) Create_Process succeeded, pid=4353 09/26/18 12:59:44 (pid:4349) CCBListener: no activity from CCB server in 5718s; assuming connection is dead. 09/26/18 12:59:44 (pid:4349) CCBListener: connection to CCB server vccondor01.cern.ch failed; will try to reconnect in 60 seconds. 09/26/18 12:59:44 (pid:4349) condor_read() failed: recv(fd=9) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from <188.184.94.254:9618>. 09/26/18 12:59:44 (pid:4349) IO: Failed to read packet header 09/26/18 12:59:44 (pid:4349) Lost connection to shadow, waiting 172800 secs for reconnect 09/26/18 13:00:45 (pid:4349) WARNING: forward resolution of 167.142.142.128.in-addr.arpa doesn't match 128.142.142.167! 09/26/18 13:00:45 (pid:4349) CCBListener: registered with CCB server vccondor01.cern.ch as ccbid 128.142.142.167:9618?addrs=128.142.142.167-9618+[2001-1458-301-98--100-99]-9618#43718333 09/26/18 13:01:27 (pid:4349) Accepted request to reconnect from <:0> 09/26/18 13:01:27 (pid:4349) Ignoring old shadow <188.184.94.254:9618?addrs=188.184.94.254-9618+[2001-1458-201-e4--100-6f8]-9618&noUDP&sock=31213_21ce_219914> 09/26/18 13:01:27 (pid:4349) Communicating with shadow <188.184.94.254:9618?addrs=188.184.94.254-9618+[2001-1458-201-e4--100-6f8]-9618&noUDP&sock=31213_21ce_219914> 09/26/18 13:01:27 (pid:4349) Recovered connection to shadow after 103 seconds 09/26/18 13:24:27 (pid:4349) Process exited, pid=4353, status=0 09/26/18 13:24:27 (pid:4349) About to exec Post script: /var/lib/condor/execute/dir_4349/post.sh 2110-617475-584 09/26/18 13:24:27 (pid:4349) Create_Process succeeded, pid=14955 09/26/18 13:24:27 (pid:4349) Process exited, pid=14955, status=0 09/26/18 13:24:29 (pid:4349) Got SIGQUIT. Performing fast shutdown. 09/26/18 13:24:29 (pid:4349) ShutdownFast all jobs. 09/26/18 13:24:29 (pid:4349) **** condor_starter (condor_STARTER) pid 4349 EXITING WITH STATUS 0 09/26/18 13:24:29 (pid:14959) ****************************************************** 09/26/18 13:24:29 (pid:14959) ** condor_starter (CONDOR_STARTER) STARTING UP 09/26/18 13:24:29 (pid:14959) ** /usr/sbin/condor_starter 09/26/18 13:24:29 (pid:14959) ** SubsystemInfo: name=STARTER type=STARTER(8) class=DAEMON(1) 09/26/18 13:24:29 (pid:14959) ** Configuration: subsystem:STARTER local:<NONE> class:DAEMON 09/26/18 13:24:29 (pid:14959) ** $CondorVersion: 8.6.10 Mar 12 2018 BuildID: 435200 $ 09/26/18 13:24:29 (pid:14959) ** $CondorPlatform: x86_64_RedHat6 $ 09/26/18 13:24:29 (pid:14959) ** PID = 14959 09/26/18 13:24:29 (pid:14959) ** Log last touched 9/26 13:24:29 09/26/18 13:24:29 (pid:14959) ****************************************************** |
Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0 |
I received a new Theory task at about 2:59 this morning. I allowed it to run for several minutes to allow it to be assigned a Condor ID then I suspended it. At approximately 9:10 I resumed the task and it seems to have resumed the same job that was suspended at 2:59. See snippet from StarterLog below. 9/26/18 02:59:40 (pid:4260) IWD: /var/lib/condor/execute/dir_4260 09/26/18 02:59:40 (pid:4260) Renice expr "10" evaluated to 10 09/26/18 02:59:40 (pid:4260) Using wrapper /usr/local/bin/job-wrapper to exec /var/lib/condor/execute/dir_4260/condor_exec.exe 09/26/18 02:59:40 (pid:4260) Running job as user nobody 09/26/18 02:59:41 (pid:4260) Create_Process succeeded, pid=4264 09/26/18 09:12:59 (pid:4260) CCBListener: no activity from CCB server in 21202s; assuming connection is dead. 09/26/18 09:12:59 (pid:4260) CCBListener: connection to CCB server vccondor01.cern.ch failed; will try to reconnect in 60 seconds. 09/26/18 09:12:59 (pid:4260) condor_read() failed: recv(fd=9) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from <188.184.94.254:9618>. 09/26/18 09:12:59 (pid:4260) IO: Failed to read packet header 09/26/18 09:12:59 (pid:4260) Lost connection to shadow, waiting 172800 secs for reconnect 09/26/18 09:14:02 (pid:4260) CCBListener: registered with CCB server vccondor01.cern.ch as ccbid 128.142.142.167:9618?addrs=128.142.142.167-9618+[2001-1458-301-98--100-99]-9618#43719876 09/26/18 09:15:09 (pid:4260) Accepted request to reconnect from <:0> 09/26/18 09:15:09 (pid:4260) Ignoring old shadow <188.184.94.254:9618?addrs=188.184.94.254-9618+[2001-1458-201-e4--100-6f8]-9618&noUDP&sock=31213_21ce_220981> 09/26/18 09:15:09 (pid:4260) Communicating with shadow <188.184.94.254:9618?addrs=188.184.94.254-9618+[2001-1458-201-e4--100-6f8]-9618&noUDP&sock=31213_21ce_220981> 09/26/18 09:15:09 (pid:4260) Recovered connection to shadow after 130 seconds edit: I didn't think to record the event number before suspending the task but the job seems to have resumed at where I estimate the event number would have been when I suspended it. I'll try another one. |
©2024 CERN