Message boards : Theory Application : Latest versions do not resume after longer suspend
Message board moderation

To post messages, you must log in.

AuthorMessage
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1280
Credit: 8,487,471
RAC: 1,813
Message 36870 - Posted: 25 Sep 2018, 6:55:13 UTC
Last modified: 25 Sep 2018, 14:28:32 UTC

@Laurence:

Reported this at LHC-dev as well on 8 Sep 2018.
Any chance the suspend/resume will be fixed in a next version reported in https://lhcathomedev.cern.ch/lhcathome-dev/forum_thread.php?id=418&postid=5459

Tested this again here and noticed that after the resume the events are happily processed until the processes are killed after 40 minutes.

Snip of MasterLog:
09/24/18 16:40:19 Setting ready state 'Ready' for STARTD
09/24/18 17:40:14 Preen pid is 11497
09/24/18 17:40:15 DefaultReaper unexpectedly called on pid 11497, status 0.
09/25/18 07:29:11 CCBListener: no activity from CCB server in 32031s; assuming connection is dead.
09/25/18 07:29:11 CCBListener: connection to CCB server vccondor01.cern.ch failed; will try to reconnect in 60 seconds.
09/25/18 07:29:42 condor_write(): Socket closed when trying to write 2319 bytes to collector vccondor01.cern.ch, fd is 12, errno=104 Connection reset by peer
09/25/18 07:29:42 Buf::write(): condor_write() failed
09/25/18 07:30:11 WARNING: forward resolution of 167.142.142.128.in-addr.arpa doesn't match 128.142.142.167!
09/25/18 07:30:12 CCBListener: registered with CCB server vccondor01.cern.ch as ccbid 128.142.142.167:9618?addrs=128.142.142.167-9618+[2001-1458-301-98--100-99]-9618#43666166


Snip of StartLog:
09/24/18 21:28:19 Got universe "VANILLA" (5) from request classad
09/24/18 21:28:19 State change: claim-activation protocol successful
09/24/18 21:28:19 Changing activity: Idle -> Busy
09/25/18 07:29:11 CCBListener: no activity from CCB server in 32151s; assuming connection is dead.
09/25/18 07:29:11 CCBListener: connection to CCB server vccondor01.cern.ch failed; will try to reconnect in 60 seconds.
09/25/18 07:29:14 condor_write(): Socket closed when trying to write 4096 bytes to collector vccondor01.cern.ch, fd is 8, errno=104 Connection reset by peer
09/25/18 07:29:14 Buf::write(): condor_write() failed
09/25/18 07:30:11 CCBListener: registered with CCB server vccondor01.cern.ch as ccbid 128.142.142.167:9618?addrs=128.142.142.167-9618+[2001-1458-301-98--100-99]-9618#43666169


Snip of StarterLog:
09/24/18 21:28:22 (pid:7977) Starting a VANILLA universe job with ID: 472806.1380
09/24/18 21:28:22 (pid:7977) IWD: /var/lib/condor/execute/dir_7977
09/24/18 21:28:22 (pid:7977) Renice expr "10" evaluated to 10
09/24/18 21:28:22 (pid:7977) Using wrapper /usr/local/bin/job-wrapper to exec /var/lib/condor/execute/dir_7977/condor_exec.exe
09/24/18 21:28:22 (pid:7977) Running job as user nobody
09/24/18 21:28:22 (pid:7977) Create_Process succeeded, pid=7981
09/25/18 07:29:40 (pid:7977) CCBListener: no activity from CCB server in 32174s; assuming connection is dead.
09/25/18 07:29:40 (pid:7977) CCBListener: connection to CCB server vccondor01.cern.ch failed; will try to reconnect in 60 seconds.
09/25/18 07:29:40 (pid:7977) condor_read() failed: recv(fd=9) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from <188.184.94.254:9618>.
09/25/18 07:29:40 (pid:7977) IO: Failed to read packet header
09/25/18 07:29:40 (pid:7977) Lost connection to shadow, waiting 2400 secs for reconnect
09/25/18 07:30:40 (pid:7977) WARNING: forward resolution of 167.142.142.128.in-addr.arpa doesn't match 128.142.142.167!
09/25/18 07:30:42 (pid:7977) CCBListener: registered with CCB server vccondor01.cern.ch as ccbid 128.142.142.167:9618?addrs=128.142.142.167-9618+[2001-1458-301-98--100-99]-9618#43672198
09/25/18 08:09:40 (pid:7977) No reconnect from shadow for 2400 seconds, aborting job execution!
09/25/18 08:09:40 (pid:7977) ShutdownFast all jobs.
09/25/18 08:09:40 (pid:7977) Process exited, pid=7981, signal=9
09/25/18 08:09:40 (pid:7977) About to exec Post script: /var/lib/condor/execute/dir_7977/post.sh 2110-625008-582
09/25/18 08:09:40 (pid:7977) Create_Process succeeded, pid=18022
09/25/18 08:09:41 (pid:7977) Process exited, pid=18022, status=0
09/25/18 08:09:42 (pid:7977) Failed to send job exit status to shadow


Last lines of running.log after happily processing 40 minutes after the resume:
1000 events processed
dumping histograms...
  Event 1100 ( 1m 34s elapsed / 2h 21m 13s left ) -> ETA: Tue Sep 25 10:30  
1100 events processed
  Event 1200 ( 1m 43s elapsed / 2h 22m 1s left ) -> ETA: Tue Sep 25 10:31  
1200 events processed
  Event 1300 ( 1m 53s elapsed / 2h 23m 15s left ) -> ETA: Tue Sep 25 10:32  
1300 events processed


Thereafter nothing is happening. No re-connections, no new job, no proper VM-shutdown, no VM kill, only keeping an idle VM until BOINC's 18hours limit jumps in and kill the VM.
ID: 36870 · Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer

Send message
Joined: 20 Jun 14
Posts: 374
Credit: 238,712
RAC: 0
Message 36871 - Posted: 25 Sep 2018, 11:57:47 UTC - in response to Message 36870.  
Last modified: 25 Sep 2018, 11:58:52 UTC

Snip of StarterLog:
09/25/18 08:09:40 (pid:7977) No reconnect from shadow for 2400 seconds, aborting job execution!

I believe this is the smoking gun. The JobLeaseDuration was too short at the default of 2400s. New jobs should have this value set to 48 hours but we need to get through the old jobs before we can test again.
ID: 36871 · Report as offensive     Reply Quote
bronco

Send message
Joined: 13 Apr 18
Posts: 443
Credit: 8,438,885
RAC: 0
Message 36872 - Posted: 25 Sep 2018, 16:29:36 UTC - in response to Message 36871.  

Just 48 hours and then the job lease expires? With a 30 day deadline it's possible for a Theory task to be preempted by other projects for more than 48 hours. Seems like job lease duration should be 30 days.
ID: 36872 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1688
Credit: 103,716,588
RAC: 122,107
Message 36873 - Posted: 25 Sep 2018, 16:50:49 UTC - in response to Message 36872.  

Just 48 hours and then the job lease expires?
in fact, the 2400 seconds cited above means actually 40 hours - so even worse.
ID: 36873 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1280
Credit: 8,487,471
RAC: 1,813
Message 36874 - Posted: 25 Sep 2018, 17:33:15 UTC
Last modified: 25 Sep 2018, 17:35:33 UTC

Don't mix up task duration and job duration.
BOINC's task expiration time is indeed 30 days and normally a Theory-task needs between 12 and 18 hours elapsed time.
Within the VM the real jobs are running and their run-times are varying much, but 80.8% of the jobs are ready within 100 minutes
and 95% of the jobs have finished within 200 minutes, so a JobLeaseDuration of 48 hours would be sufficient for most of the suspension periods.
ID: 36874 · Report as offensive     Reply Quote
bronco

Send message
Joined: 13 Apr 18
Posts: 443
Credit: 8,438,885
RAC: 0
Message 36875 - Posted: 25 Sep 2018, 18:27:59 UTC - in response to Message 36874.  

Don't mix up task duration and job duration.


I don't think I have mixed them up but it's difficult to know whether you are responding to me or to Erich56 since your response "header" does not have an "in response to Message xxxxxxxxx" element.

BOINC's task expiration time is indeed 30 days and normally a Theory-task needs between 12 and 18 hours elapsed time.
Within the VM the real jobs are running and their run-times are varying much


I have studied the finished.log and running.log files extensively so I understand that

but 80.8% of the jobs are ready within 100 minutes and 95% of the jobs have finished within 200 minute


That I was not aware of, thank you.

so a JobLeaseDuration of 48 hours would be sufficient for most of the suspension period


Why make the number sufficient for 95% when it could be made sufficient for 99.9% with no additional effort? Is there a rational reason for cutting off that 5%?
ID: 36875 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1280
Credit: 8,487,471
RAC: 1,813
Message 36876 - Posted: 25 Sep 2018, 19:42:23 UTC - in response to Message 36875.  

Why make the number sufficient for 95% when it could be made sufficient for 99.9% with no additional effort? Is there a rational reason for cutting off that 5%?
You misunderstand, that 48 hours will cut off 5%.
48 hours will be sufficient for over 99.99% of the jobs, I suppose without evidence.
The 95% is the percentage of all jobs ready within 200 minutes.
Of 5.7 million jobs only 7 have a run-time longer than 1 day (between 1470 minutes and 2055 minutes) and 5.63 million jobs are ready within 6 hours.
48 hours will not be sufficient for those crunchers suspending a Theory task for longer than 2 days. I think they are very very rare.
ID: 36876 · Report as offensive     Reply Quote
bronco

Send message
Joined: 13 Apr 18
Posts: 443
Credit: 8,438,885
RAC: 0
Message 36877 - Posted: 25 Sep 2018, 22:14:37 UTC - in response to Message 36876.  

48 hours will not be sufficient for those crunchers suspending a Theory task for longer than 2 days. I think they are very very rare.


Why it cut off any job that isn't looping/stalled and has not run into the 18 hour run limit or the 30 day deadline? Why? Does it make the database appreciably smaller? The fact that it cuts of very, very few doesn't make any sense unless there is some benefit to be derived from doing so.
ID: 36877 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2411
Credit: 226,279,839
RAC: 130,903
Message 36879 - Posted: 26 Sep 2018, 6:26:41 UTC - in response to Message 36877.  

Why it cut off any job that isn't looping/stalled ...

Because it may not be possible to detect all kind of endless loops.

You may treat it like a car airbag.
Most people will never need it, but some are very happy that they had one.

Well, I guess I know your answer:
"I never needed an airbag since I made my driving licence on one of Pharaoh Ramses' brand-new chariots."
ID: 36879 · Report as offensive     Reply Quote
bronco

Send message
Joined: 13 Apr 18
Posts: 443
Credit: 8,438,885
RAC: 0
Message 36881 - Posted: 26 Sep 2018, 7:37:30 UTC - in response to Message 36879.  

Why it cut off any job that isn't looping/stalled ...

Because it may not be possible to detect all kind of endless loops.


If a job is looping it will most certainly end at the 18 hour task limit when the VM is destroyed. We all know that so why even bother stating the obvious? To deflect from the fact that you cannot answer my question and give me a REAL reason to cut off a job that may be viable simply because it has been suspended for more than 2 days?

You may treat it like a car airbag.


Airbag is a good analogy. You would deploy the airbag even when there is no collision detected.

Most people will never need it, but some are very happy that they had one.

Happy when the airbag deploys in a collision. Not so happy when it deploys when there is no collision.

Well, I guess I know your answer:
"I never needed an airbag since I made my driving licence on one of Pharaoh Ramses' brand-new chariots."


ROFLMAO Wrong again.
ID: 36881 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1280
Credit: 8,487,471
RAC: 1,813
Message 36882 - Posted: 26 Sep 2018, 7:50:16 UTC - in response to Message 36881.  
Last modified: 26 Sep 2018, 7:52:07 UTC

... give me a REAL reason to cut off a job that may be viable simply because it has been suspended for more than 2 days?

I can't give a/the REAL reason, cause I don't know all ins and outs of CERN's processing flows,
but I can imagine that when one have to wait for 30 days, there would stay too many connections open.
48 hours is arbitrary, but 40 minutes was really to short for us BOINC-crunchers.
Keep in mind that most Theory jobs are not done by BOINC.
ID: 36882 · Report as offensive     Reply Quote
bronco

Send message
Joined: 13 Apr 18
Posts: 443
Credit: 8,438,885
RAC: 0
Message 36884 - Posted: 26 Sep 2018, 8:19:23 UTC - in response to Message 36882.  

Yes, 48 hours is much better than 40 minutes.
Too many connections open makes sense to me. Thanks.
ID: 36884 · Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer

Send message
Joined: 20 Jun 14
Posts: 374
Credit: 238,712
RAC: 0
Message 36885 - Posted: 26 Sep 2018, 8:22:04 UTC - in response to Message 36882.  
Last modified: 26 Sep 2018, 8:22:43 UTC

Yes 48 hours is arbitrary. The use case is supporting an overnight suspend/resume and we can extend this to one day. So 48 hours should be long enough to support a 24 hour suspend/resume. In general we do wan't the results of a specific job so there needs to be a cutoff period where we consider the job lost.
ID: 36885 · Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer

Send message
Joined: 20 Jun 14
Posts: 374
Credit: 238,712
RAC: 0
Message 36886 - Posted: 26 Sep 2018, 8:22:27 UTC - in response to Message 36885.  

Am testing the suspend/resume now.
ID: 36886 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1280
Credit: 8,487,471
RAC: 1,813
Message 36887 - Posted: 26 Sep 2018, 9:29:36 UTC - in response to Message 36886.  

Am testing the suspend/resume now.

The jobs with the old setting are gone:
09/26/18 11:21:57 (pid:4348) CCBListener: no activity from CCB server in 4949s; assuming connection is dead.
09/26/18 11:21:57 (pid:4348) CCBListener: connection to CCB server vccondor01.cern.ch failed; will try to reconnect in 60 seconds.
09/26/18 11:21:57 (pid:4348) condor_read() failed: recv(fd=9) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from <188.184.94.254:9618>.
09/26/18 11:21:57 (pid:4348) IO: Failed to read packet header
09/26/18 11:21:57 (pid:4348) Lost connection to shadow, waiting 172800 secs for reconnect
Interesting of course is whether the finished jobs after the resume are arriving and used.
ID: 36887 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1280
Credit: 8,487,471
RAC: 1,813
Message 36888 - Posted: 26 Sep 2018, 9:54:10 UTC - in response to Message 36887.  

Interesting of course is whether the finished jobs after the resume are arriving and used.
The first job after the resume has finished, but until now I don't get a new job and after 10 minutes the VM was killed.

09/26/18 09:39:29 (pid:4348) Running job as user nobody
09/26/18 09:39:29 (pid:4348) Create_Process succeeded, pid=4352
09/26/18 11:21:57 (pid:4348) CCBListener: no activity from CCB server in 4949s; assuming connection is dead.
09/26/18 11:21:57 (pid:4348) CCBListener: connection to CCB server vccondor01.cern.ch failed; will try to reconnect in 60 seconds.
09/26/18 11:21:57 (pid:4348) condor_read() failed: recv(fd=9) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from <188.184.94.254:9618>.
09/26/18 11:21:57 (pid:4348) IO: Failed to read packet header
09/26/18 11:21:57 (pid:4348) Lost connection to shadow, waiting 172800 secs for reconnect
09/26/18 11:22:58 (pid:4348) WARNING: forward resolution of 167.142.142.128.in-addr.arpa doesn't match 128.142.142.167!
09/26/18 11:22:58 (pid:4348) CCBListener: registered with CCB server vccondor01.cern.ch as ccbid 128.142.142.167:9618?addrs=128.142.142.167-9618+[2001-1458-301-98--100-99]-9618#43717470
09/26/18 11:23:03 (pid:4348) Accepted request to reconnect from <:0>
09/26/18 11:23:03 (pid:4348) Ignoring old shadow <188.184.94.254:9618?addrs=188.184.94.254-9618+[2001-1458-201-e4--100-6f8]-9618&noUDP&sock=31213_21ce_219360>
09/26/18 11:23:03 (pid:4348) Communicating with shadow <188.184.94.254:9618?addrs=188.184.94.254-9618+[2001-1458-201-e4--100-6f8]-9618&noUDP&sock=31213_21ce_219360>
09/26/18 11:23:03 (pid:4348) Recovered connection to shadow after 66 seconds
09/26/18 11:41:01 (pid:4348) Process exited, pid=4352, status=0
09/26/18 11:41:01 (pid:4348) About to exec Post script: /var/lib/condor/execute/dir_4348/post.sh 2110-611195-584
09/26/18 11:41:01 (pid:4348) Create_Process succeeded, pid=12979
09/26/18 11:41:02 (pid:4348) Process exited, pid=12979, status=0
09/26/18 11:41:03 (pid:4348) Got SIGQUIT.  Performing fast shutdown.
09/26/18 11:41:03 (pid:4348) ShutdownFast all jobs.
09/26/18 11:41:03 (pid:4348) **** condor_starter (condor_STARTER) pid 4348 EXITING WITH STATUS 0
ID: 36888 · Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer

Send message
Joined: 20 Jun 14
Posts: 374
Credit: 238,712
RAC: 0
Message 36890 - Posted: 26 Sep 2018, 11:27:35 UTC - in response to Message 36888.  

I got a new new job.
ID: 36890 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1280
Credit: 8,487,471
RAC: 1,813
Message 36891 - Posted: 26 Sep 2018, 11:32:58 UTC

With the next try after a longer suspend and resume, a new job arrived after condor_starter was started up again.

09/26/18 10:09:26 (pid:4349) Create_Process succeeded, pid=4353
09/26/18 12:59:44 (pid:4349) CCBListener: no activity from CCB server in 5718s; assuming connection is dead.
09/26/18 12:59:44 (pid:4349) CCBListener: connection to CCB server vccondor01.cern.ch failed; will try to reconnect in 60 seconds.
09/26/18 12:59:44 (pid:4349) condor_read() failed: recv(fd=9) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from <188.184.94.254:9618>.
09/26/18 12:59:44 (pid:4349) IO: Failed to read packet header
09/26/18 12:59:44 (pid:4349) Lost connection to shadow, waiting 172800 secs for reconnect
09/26/18 13:00:45 (pid:4349) WARNING: forward resolution of 167.142.142.128.in-addr.arpa doesn't match 128.142.142.167!
09/26/18 13:00:45 (pid:4349) CCBListener: registered with CCB server vccondor01.cern.ch as ccbid 128.142.142.167:9618?addrs=128.142.142.167-9618+[2001-1458-301-98--100-99]-9618#43718333
09/26/18 13:01:27 (pid:4349) Accepted request to reconnect from <:0>
09/26/18 13:01:27 (pid:4349) Ignoring old shadow <188.184.94.254:9618?addrs=188.184.94.254-9618+[2001-1458-201-e4--100-6f8]-9618&noUDP&sock=31213_21ce_219914>
09/26/18 13:01:27 (pid:4349) Communicating with shadow <188.184.94.254:9618?addrs=188.184.94.254-9618+[2001-1458-201-e4--100-6f8]-9618&noUDP&sock=31213_21ce_219914>
09/26/18 13:01:27 (pid:4349) Recovered connection to shadow after 103 seconds
09/26/18 13:24:27 (pid:4349) Process exited, pid=4353, status=0
09/26/18 13:24:27 (pid:4349) About to exec Post script: /var/lib/condor/execute/dir_4349/post.sh 2110-617475-584
09/26/18 13:24:27 (pid:4349) Create_Process succeeded, pid=14955
09/26/18 13:24:27 (pid:4349) Process exited, pid=14955, status=0
09/26/18 13:24:29 (pid:4349) Got SIGQUIT.  Performing fast shutdown.
09/26/18 13:24:29 (pid:4349) ShutdownFast all jobs.
09/26/18 13:24:29 (pid:4349) **** condor_starter (condor_STARTER) pid 4349 EXITING WITH STATUS 0
09/26/18 13:24:29 (pid:14959) ******************************************************
09/26/18 13:24:29 (pid:14959) ** condor_starter (CONDOR_STARTER) STARTING UP
09/26/18 13:24:29 (pid:14959) ** /usr/sbin/condor_starter
09/26/18 13:24:29 (pid:14959) ** SubsystemInfo: name=STARTER type=STARTER(8) class=DAEMON(1)
09/26/18 13:24:29 (pid:14959) ** Configuration: subsystem:STARTER local:<NONE> class:DAEMON
09/26/18 13:24:29 (pid:14959) ** $CondorVersion: 8.6.10 Mar 12 2018 BuildID: 435200 $
09/26/18 13:24:29 (pid:14959) ** $CondorPlatform: x86_64_RedHat6 $
09/26/18 13:24:29 (pid:14959) ** PID = 14959
09/26/18 13:24:29 (pid:14959) ** Log last touched 9/26 13:24:29
09/26/18 13:24:29 (pid:14959) ******************************************************
ID: 36891 · Report as offensive     Reply Quote
bronco

Send message
Joined: 13 Apr 18
Posts: 443
Credit: 8,438,885
RAC: 0
Message 36892 - Posted: 26 Sep 2018, 15:35:55 UTC
Last modified: 26 Sep 2018, 15:49:47 UTC

I received a new Theory task at about 2:59 this morning. I allowed it to run for several minutes to allow it to be assigned a Condor ID then I suspended it.
At approximately 9:10 I resumed the task and it seems to have resumed the same job that was suspended at 2:59. See snippet from StarterLog below.

  
9/26/18 02:59:40 (pid:4260) IWD: /var/lib/condor/execute/dir_4260
09/26/18 02:59:40 (pid:4260) Renice expr "10" evaluated to 10
09/26/18 02:59:40 (pid:4260) Using wrapper /usr/local/bin/job-wrapper to exec /var/lib/condor/execute/dir_4260/condor_exec.exe
09/26/18 02:59:40 (pid:4260) Running job as user nobody
09/26/18 02:59:41 (pid:4260) Create_Process succeeded, pid=4264
09/26/18 09:12:59 (pid:4260) CCBListener: no activity from CCB server in 21202s; assuming connection is dead.
09/26/18 09:12:59 (pid:4260) CCBListener: connection to CCB server vccondor01.cern.ch failed; will try to reconnect in 60 seconds.
09/26/18 09:12:59 (pid:4260) condor_read() failed: recv(fd=9) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from <188.184.94.254:9618>.
09/26/18 09:12:59 (pid:4260) IO: Failed to read packet header
09/26/18 09:12:59 (pid:4260) Lost connection to shadow, waiting 172800 secs for reconnect
09/26/18 09:14:02 (pid:4260) CCBListener: registered with CCB server vccondor01.cern.ch as ccbid 128.142.142.167:9618?addrs=128.142.142.167-9618+[2001-1458-301-98--100-99]-9618#43719876
09/26/18 09:15:09 (pid:4260) Accepted request to reconnect from <:0>
09/26/18 09:15:09 (pid:4260) Ignoring old shadow <188.184.94.254:9618?addrs=188.184.94.254-9618+[2001-1458-201-e4--100-6f8]-9618&noUDP&sock=31213_21ce_220981>
09/26/18 09:15:09 (pid:4260) Communicating with shadow <188.184.94.254:9618?addrs=188.184.94.254-9618+[2001-1458-201-e4--100-6f8]-9618&noUDP&sock=31213_21ce_220981>
09/26/18 09:15:09 (pid:4260) Recovered connection to shadow after 130 seconds


edit:
I didn't think to record the event number before suspending the task but the job seems to have resumed at where I estimate the event number would have been when I suspended it.
I'll try another one.
ID: 36892 · Report as offensive     Reply Quote

Message boards : Theory Application : Latest versions do not resume after longer suspend


©2024 CERN