Message boards : Theory Application : Theory's endless looping
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 3 · 4 · 5 · 6

AuthorMessage
bronco

Send message
Joined: 13 Apr 18
Posts: 443
Credit: 8,438,885
RAC: 0
Message 38657 - Posted: 27 Apr 2019, 23:24:25 UTC - in response to Message 38656.  

The due date is easy. The script finds it in .../slots/x/init_data.xml as <computation_deadline>.

Right.

In ../slots/4/init_data.xml I see <rsc_disk_bound>8000000000.000000</rsc_disk_bound>

This is the value (in bytes) the slots folder must not exceed.


OK. But I think you meant to say "slot folders" rather than "slots folder", yes? In other words monitor the size of the numbered slot folders in the folder named "slots", not the size of the folder named "slots".

Never mind, I figured it out.
ID: 38657 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,899,033
RAC: 138,178
Message 38669 - Posted: 30 Apr 2019, 5:13:00 UTC - in response to Message 38643.  

... I'm not yet sure if there is a #3. "Condor runtime limit".

This VM was shut down when the last job reached a runtime of a bit more than 36 h.
I wonder if there is an additional watchdog that normally doesn't become active as the 18 h limit is usually reached before.

Does anyone know?

https://lhcathome.cern.ch/lhcathome/result.php?resultid=221858696
2019-04-28 17:42:59 (98673): Guest Log: [INFO] New Job Starting in slot1
2019-04-28 17:42:59 (98673): Guest Log: [INFO] Condor JobID:  495583.12 in slot1
2019-04-28 17:43:04 (98673): Guest Log: [INFO] MCPlots JobID: 49683949 in slot1
2019-04-28 17:44:23 Volunteer's Extension: Running Job: ===> [runRivet] Sun Apr 28 17:42:55 CEST 2019 [boinc ee zhad 91.2 - - sherpa 2.2.5 default 3000 48]
.
.
.
2019-04-30 04:28:01 (98673): Status Report: Job Duration: '300000.000000'
2019-04-30 04:28:01 (98673): Status Report: Elapsed Time: '132045.354213'
2019-04-30 04:28:01 (98673): Status Report: CPU Time: '91213.620000'

<edit>Expected a "Job finished in slot..." here but that's missing</edit>

2019-04-30 05:55:30 (98673): Guest Log: [INFO] Condor exited with return value N/A.
2019-04-30 05:55:30 (98673): Guest Log: [INFO] Shutting Down.
ID: 38669 · Report as offensive     Reply Quote
bronco

Send message
Joined: 13 Apr 18
Posts: 443
Credit: 8,438,885
RAC: 0
Message 38670 - Posted: 30 Apr 2019, 10:11:08 UTC - in response to Message 38669.  
Last modified: 30 Apr 2019, 10:11:45 UTC

... I'm not yet sure if there is a #3. "Condor runtime limit".

This VM was shut down when the last job reached a runtime of a bit more than 36 h.
I wonder if there is an additional watchdog that normally doesn't become active as the 18 h limit is usually reached before.

Does anyone know?

See https://lhcathome.cern.ch/lhcathome/result.php?resultid=221836220
2019-04-29 23:02:03 (13231): Status Report: Job Duration: '864000.000000'
2019-04-29 23:02:03 (13231): Status Report: Elapsed Time: '151661.307370'
2019-04-29 23:02:03 (13231): Status Report: CPU Time: '145470.420000'
2019-04-29 23:02:34 (13231): Guest Log: [INFO] Job finished in slot1 with 0.
2019-04-29 23:12:45 (13231): Guest Log: [INFO] Condor exited with return value N/A.
2019-04-29 23:12:45 (13231): Guest Log: [INFO] Shutting Down.
2019-04-29 23:12:45 (13231): VM Completion File Detected.
2019-04-29 23:12:45 (13231): VM Completion Message: Condor exited with return value N/A.
.
2019-04-29 23:12:45 (13231): Powering off VM.
2019-04-29 23:12:46 (13231): Successfully stopped VM.
2019-04-29 23:12:46 (13231): Deregistering VM. (boinc_ccb51ef676d1e747, slot#2)
2019-04-29 23:12:46 (13231): Removing network bandwidth throttle group from VM.
2019-04-29 23:12:46 (13231): Removing storage controller(s) from VM.
2019-04-29 23:12:46 (13231): Removing VM from VirtualBox.
2019-04-29 23:12:46 (13231): Removing virtual disk drive from VirtualBox.
23:12:51 (13231): called boinc_finish(0)

If there is such an additional watchdog then it appears to be inconsistent. The above ran for 42 hours and got the expected "Job finished in slot" at 2019-04-29 23:02:34.
ID: 38670 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,899,033
RAC: 138,178
Message 38672 - Posted: 30 Apr 2019, 17:33:55 UTC - in response to Message 38670.  

If there is such an additional watchdog then it appears to be inconsistent.

Not necessarily.
Your example shows a "VM state change" from running to paused and later back to running.
This may have reset the shutdown timer.

The runtime between the last state change and "Job finished" was less than 27 h.
My example has been running for more than 36 h without a break.

2019-04-28 20:02:27 (25942): VM state change detected. (old = 'running', new = 'paused')
.
.
.
2019-04-28 20:22:55 (13231): VM state change detected. (old = 'poweroff', new = 'running')
.
.
.
2019-04-29 23:02:34 (13231): Guest Log: [INFO] Job finished in slot1 with 0.
ID: 38672 · Report as offensive     Reply Quote
bronco

Send message
Joined: 13 Apr 18
Posts: 443
Credit: 8,438,885
RAC: 0
Message 38673 - Posted: 30 Apr 2019, 17:46:09 UTC - in response to Message 38672.  

If there is such an additional watchdog then it appears to be inconsistent.

Not necessarily.
Your example shows a "VM state change" from running to paused and later back to running.
This may have reset the shutdown timer.

I missed that line and yes it may have reset the timer if there is one (yours may shutdown for some reason other than a Condor imposed limit)
ID: 38673 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,899,033
RAC: 138,178
Message 38680 - Posted: 1 May 2019, 12:05:46 UTC

Again a longrunner where the last job failed after just a bit more than 36 h runtime:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=221925569
ID: 38680 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,899,033
RAC: 138,178
Message 38681 - Posted: 1 May 2019, 13:12:33 UTC

Another task that was stopped at the 36-h-job-limit:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=221909367
ID: 38681 · Report as offensive     Reply Quote
bronco

Send message
Joined: 13 Apr 18
Posts: 443
Credit: 8,438,885
RAC: 0
Message 38682 - Posted: 1 May 2019, 13:53:35 UTC - in response to Message 38680.  

Again a longrunner where the last job failed after just a bit more than 36 h runtime:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=221925569

OK, I'm convinced Condor imposes a 36.6 hour limit. If suspending the task causes Condor to reset the elapsed time counter then it would seem a watchdog could extend the job beyond 36.6 by:
1) suspending the task when current job reaches 36 hours
2) wait until VBox reports (via VBox Manage) that the VM has properly suspended
3) resume the task
ID: 38682 · Report as offensive     Reply Quote
bronco

Send message
Joined: 13 Apr 18
Posts: 443
Credit: 8,438,885
RAC: 0
Message 38694 - Posted: 4 May 2019, 15:06:51 UTC - in response to Message 38682.  

Again a longrunner where the last job failed after just a bit more than 36 h runtime:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=221925569

OK, I'm convinced Condor imposes a 36.6 hour limit. If suspending the task causes Condor to reset the elapsed time counter then it would seem a watchdog could extend the job beyond 36.6 by:
1) suspending the task when current job reaches 36 hours
2) wait until VBox reports (via VBox Manage) that the VM has properly suspended
3) resume the task

I modified the above 3 steps to something easier to code and observe. Steps are now:
1) suspend task at integer multiples of 35 hours elapsed task time (so 35, 70, 105...)
2) user manually resumes the task

It seems to work. I have:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=222336270 with tet (task elapsed time) 2 days 9 hours crunching a Sherpa 2.2.4 with jet (job elapsed time) 50 hours
https://lhcathome.cern.ch/lhcathome/result.php?resultid=222334188 with tet 2 days 8 hours crunching a Sherpa 1.2.3 with jet 41 hours.
ID: 38694 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Nov 14
Posts: 602
Credit: 24,371,321
RAC: 0
Message 38700 - Posted: 5 May 2019, 15:43:03 UTC - in response to Message 38694.  

You would think that, being the smart people that they are, they would eventually figure out that impossible-to-run tasks drives crunchers away.
Then, by removing those tasks, they could get more work done.

Just the fact that they would like to get them done does not automatically accomplish it.
ID: 38700 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,341,407
RAC: 101,788
Message 38701 - Posted: 5 May 2019, 15:50:09 UTC - in response to Message 38700.  

You would think that, being the smart people that they are, they would eventually figure out that impossible-to-run tasks drives crunchers away.
Then, by removing those tasks, they could get more work done.

Just the fact that they would like to get them done does not automatically accomplish it.

Jim, full agreement from my side !!!

But, as we can see: the reality is a different one, unfortunately :-(
ID: 38701 · Report as offensive     Reply Quote
[AF>Le_Pommier] Jerome_C2005

Send message
Joined: 12 Jul 11
Posts: 93
Credit: 1,129,876
RAC: 7
Message 38752 - Posted: 9 May 2019, 10:04:49 UTC
Last modified: 9 May 2019, 10:06:29 UTC

Hi

I realized yesterday that I had a Native Theory task that had been running for the past 12 days !

18:31:11 2019-04-25: cranky-0.0.28: [INFO] Running Container 'runc'.
19:46:36 2019-04-26: cranky-0.0.28: [INFO] Pausing container TheoryN_2279-771890-48_0.
19:46:36 2019-04-26: cranky-0.0.28: [WARNING] Cannot pause container as /sys/fs/cgroup/freezer/boinc/freezer.state not exists.
20:22:43 2019-04-26: cranky-0.0.28: [INFO] Resuming container TheoryN_2279-771890-48_0.
container not paused


The status of the task in BoincTui was "dead" (I had never seen that) but it was eating a full CPU !

I killed it.

I didn't read that thread but I liked the title.
ID: 38752 · Report as offensive     Reply Quote
Profile Ray Murray
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 281
Credit: 11,859,285
RAC: 1
Message 39353 - Posted: 13 Jul 2019, 16:19:01 UTC

08:18:47 +0200 2019-07-13 [INFO] New Job Starting in slot1
08:18:47 +0200 2019-07-13 [INFO] Condor JobID: 503065.34 in slot1
08:18:52 +0200 2019-07-13 [INFO] MCPlots JobID: 50643654 in slot1

===> [runRivet] Sat Jul 13 08:18:47 CEST 2019 [boinc pp jets 7000 150,-,2160 - sherpa 2.2.2 default 2000 78]

Display update finished (0 histograms, 0 events).
2.47481e-12 pb +- ( 6.98515e-12 pb = 282.25 % ) 840000 ( 18224016 -> 8 % )
integration time: ( 7h 44m 28s elapsed / 24252d 19h 43m 44s left ) [17:59:12]
Updating display...

Euthanised humanely
ID: 39353 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1268
Credit: 8,421,616
RAC: 2,139
Message 39354 - Posted: 13 Jul 2019, 17:38:03 UTC - in response to Message 39353.  

===> [runRivet] Sat Jul 13 08:18:47 CEST 2019 [boinc pp jets 7000 150,-,2160 - sherpa 2.2.2 default 2000 78]
.......
Euthanised humanely
That's not patient for the patient ;)

I digged into the recent MC Production and found 18000 successful events coming from 5 successful runs for that job description out of 40 attempts. 4 attempts failed and 31 were lost.
ID: 39354 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1268
Credit: 8,421,616
RAC: 2,139
Message 39459 - Posted: 29 Jul 2019, 19:39:17 UTC
Last modified: 29 Jul 2019, 19:44:53 UTC

===> [runRivet] Mon Jul 29 04:48:43 CEST 2019 [boinc pp jets 13000 250,-,4160 - sherpa 2.2.5 default 1000 81]

A few weeks ago only 1 attempt out of 79 with the above job description succeeded..
The last 2 weeks from 6 new attempts 2 extra jobs succeeded so even with

integration time: ( 9h 14m 7s elapsed / 21044d 22h 50m 51s left ) [18:23:04], I will give it a try.
ID: 39459 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2071
Credit: 156,084,902
RAC: 104,657
Message 39462 - Posted: 29 Jul 2019, 20:19:41 UTC

Hi Crystal,
this one finished today in -native
06:57:03 CEST +02:00 2019-07-29: cranky-0.0.29: [INFO] ===> [runRivet] Mon Jul 29 04:57:03 UTC 2019 [boinc pp jets 7000 80,-,1460 - sherpa 2.2.5 default 100000 84]
17:43:58 CEST +02:00 2019-07-29: cranky-0.0.29: [INFO] Container 'runc' finished with status code 0.
ID: 39462 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1268
Credit: 8,421,616
RAC: 2,139
Message 39471 - Posted: 30 Jul 2019, 17:07:06 UTC - in response to Message 39459.  

===> [runRivet] Mon Jul 29 04:48:43 CEST 2019 [boinc pp jets 13000 250,-,4160 - sherpa 2.2.5 default 1000 81]

A few weeks ago only 1 attempt out of 79 with the above job description succeeded..
The last 2 weeks from 6 new attempts 2 extra jobs succeeded so even with

integration time: ( 9h 14m 7s elapsed / 21044d 22h 50m 51s left ) [18:23:04], I will give it a try.
Unfortunately the VM was closed prematurely without my intervention and although I had set the VM's lifetime to 10 days.
Don't know where that completion file came from so early. The job surely wasn't finished.

Run time 1 days 18 hours 40 min 28 sec
CPU time 1 days 15 hours 34 min 46 sec
ID: 39471 · Report as offensive     Reply Quote
Previous · 1 . . . 3 · 4 · 5 · 6

Message boards : Theory Application : Theory's endless looping


©2024 CERN