Theory's endless looping

Author	Message
bronco Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0	Message 38657 - Posted: 27 Apr 2019, 23:24:25 UTC - in response to Message 38656. The due date is easy. The script finds it in .../slots/x/init_data.xml as <computation_deadline>. Right. In ../slots/4/init_data.xml I see <rsc_disk_bound>8000000000.000000</rsc_disk_bound> This is the value (in bytes) the slots folder must not exceed. OK. But I think you meant to say "slot folders" rather than "slots folder", yes? In other words monitor the size of the numbered slot folders in the folder named "slots", not the size of the folder named "slots". Never mind, I figured it out. ID: 38657 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2628 Credit: 267,218,564 RAC: 128,995	Message 38669 - Posted: 30 Apr 2019, 5:13:00 UTC - in response to Message 38643. ... I'm not yet sure if there is a #3. "Condor runtime limit". This VM was shut down when the last job reached a runtime of a bit more than 36 h. I wonder if there is an additional watchdog that normally doesn't become active as the 18 h limit is usually reached before. Does anyone know? https://lhcathome.cern.ch/lhcathome/result.php?resultid=221858696 2019-04-28 17:42:59 (98673): Guest Log: [INFO] New Job Starting in slot1 2019-04-28 17:42:59 (98673): Guest Log: [INFO] Condor JobID: 495583.12 in slot1 2019-04-28 17:43:04 (98673): Guest Log: [INFO] MCPlots JobID: 49683949 in slot1 2019-04-28 17:44:23 Volunteer's Extension: Running Job: ===> [runRivet] Sun Apr 28 17:42:55 CEST 2019 [boinc ee zhad 91.2 - - sherpa 2.2.5 default 3000 48] . . . 2019-04-30 04:28:01 (98673): Status Report: Job Duration: '300000.000000' 2019-04-30 04:28:01 (98673): Status Report: Elapsed Time: '132045.354213' 2019-04-30 04:28:01 (98673): Status Report: CPU Time: '91213.620000' <edit>Expected a "Job finished in slot..." here but that's missing</edit> 2019-04-30 05:55:30 (98673): Guest Log: [INFO] Condor exited with return value N/A. 2019-04-30 05:55:30 (98673): Guest Log: [INFO] Shutting Down. ID: 38669 · Reply Quote

bronco Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0	Message 38670 - Posted: 30 Apr 2019, 10:11:08 UTC - in response to Message 38669. Last modified: 30 Apr 2019, 10:11:45 UTC ... I'm not yet sure if there is a #3. "Condor runtime limit". This VM was shut down when the last job reached a runtime of a bit more than 36 h. I wonder if there is an additional watchdog that normally doesn't become active as the 18 h limit is usually reached before. Does anyone know? See https://lhcathome.cern.ch/lhcathome/result.php?resultid=221836220 2019-04-29 23:02:03 (13231): Status Report: Job Duration: '864000.000000' 2019-04-29 23:02:03 (13231): Status Report: Elapsed Time: '151661.307370' 2019-04-29 23:02:03 (13231): Status Report: CPU Time: '145470.420000' 2019-04-29 23:02:34 (13231): Guest Log: [INFO] Job finished in slot1 with 0. 2019-04-29 23:12:45 (13231): Guest Log: [INFO] Condor exited with return value N/A. 2019-04-29 23:12:45 (13231): Guest Log: [INFO] Shutting Down. 2019-04-29 23:12:45 (13231): VM Completion File Detected. 2019-04-29 23:12:45 (13231): VM Completion Message: Condor exited with return value N/A. . 2019-04-29 23:12:45 (13231): Powering off VM. 2019-04-29 23:12:46 (13231): Successfully stopped VM. 2019-04-29 23:12:46 (13231): Deregistering VM. (boinc_ccb51ef676d1e747, slot#2) 2019-04-29 23:12:46 (13231): Removing network bandwidth throttle group from VM. 2019-04-29 23:12:46 (13231): Removing storage controller(s) from VM. 2019-04-29 23:12:46 (13231): Removing VM from VirtualBox. 2019-04-29 23:12:46 (13231): Removing virtual disk drive from VirtualBox. 23:12:51 (13231): called boinc_finish(0) If there is such an additional watchdog then it appears to be inconsistent. The above ran for 42 hours and got the expected "Job finished in slot" at 2019-04-29 23:02:34. ID: 38670 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2628 Credit: 267,218,564 RAC: 128,995	Message 38672 - Posted: 30 Apr 2019, 17:33:55 UTC - in response to Message 38670. If there is such an additional watchdog then it appears to be inconsistent. Not necessarily. Your example shows a "VM state change" from running to paused and later back to running. This may have reset the shutdown timer. The runtime between the last state change and "Job finished" was less than 27 h. My example has been running for more than 36 h without a break. 2019-04-28 20:02:27 (25942): VM state change detected. (old = 'running', new = 'paused') . . . 2019-04-28 20:22:55 (13231): VM state change detected. (old = 'poweroff', new = 'running') . . . 2019-04-29 23:02:34 (13231): Guest Log: [INFO] Job finished in slot1 with 0. ID: 38672 · Reply Quote

bronco Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0	Message 38673 - Posted: 30 Apr 2019, 17:46:09 UTC - in response to Message 38672. If there is such an additional watchdog then it appears to be inconsistent. Not necessarily. Your example shows a "VM state change" from running to paused and later back to running. This may have reset the shutdown timer. I missed that line and yes it may have reset the timer if there is one (yours may shutdown for some reason other than a Condor imposed limit) ID: 38673 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2628 Credit: 267,218,564 RAC: 128,995	Message 38680 - Posted: 1 May 2019, 12:05:46 UTC Again a longrunner where the last job failed after just a bit more than 36 h runtime: https://lhcathome.cern.ch/lhcathome/result.php?resultid=221925569 ID: 38680 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2628 Credit: 267,218,564 RAC: 128,995	Message 38681 - Posted: 1 May 2019, 13:12:33 UTC Another task that was stopped at the 36-h-job-limit: https://lhcathome.cern.ch/lhcathome/result.php?resultid=221909367 ID: 38681 · Reply Quote

bronco Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0	Message 38682 - Posted: 1 May 2019, 13:53:35 UTC - in response to Message 38680. Again a longrunner where the last job failed after just a bit more than 36 h runtime: https://lhcathome.cern.ch/lhcathome/result.php?resultid=221925569 OK, I'm convinced Condor imposes a 36.6 hour limit. If suspending the task causes Condor to reset the elapsed time counter then it would seem a watchdog could extend the job beyond 36.6 by: 1) suspending the task when current job reaches 36 hours 2) wait until VBox reports (via VBox Manage) that the VM has properly suspended 3) resume the task ID: 38682 · Reply Quote

bronco Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0	Message 38694 - Posted: 4 May 2019, 15:06:51 UTC - in response to Message 38682. Again a longrunner where the last job failed after just a bit more than 36 h runtime: https://lhcathome.cern.ch/lhcathome/result.php?resultid=221925569 OK, I'm convinced Condor imposes a 36.6 hour limit. If suspending the task causes Condor to reset the elapsed time counter then it would seem a watchdog could extend the job beyond 36.6 by: 1) suspending the task when current job reaches 36 hours 2) wait until VBox reports (via VBox Manage) that the VM has properly suspended 3) resume the task I modified the above 3 steps to something easier to code and observe. Steps are now: 1) suspend task at integer multiples of 35 hours elapsed task time (so 35, 70, 105...) 2) user manually resumes the task It seems to work. I have: https://lhcathome.cern.ch/lhcathome/result.php?resultid=222336270 with tet (task elapsed time) 2 days 9 hours crunching a Sherpa 2.2.4 with jet (job elapsed time) 50 hours https://lhcathome.cern.ch/lhcathome/result.php?resultid=222334188 with tet 2 days 8 hours crunching a Sherpa 1.2.3 with jet 41 hours. ID: 38694 · Reply Quote

Jim1348 Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0	Message 38700 - Posted: 5 May 2019, 15:43:03 UTC - in response to Message 38694. You would think that, being the smart people that they are, they would eventually figure out that impossible-to-run tasks drives crunchers away. Then, by removing those tasks, they could get more work done. Just the fact that they would like to get them done does not automatically accomplish it. ID: 38700 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1862 Credit: 130,811,655 RAC: 102,426	Message 38701 - Posted: 5 May 2019, 15:50:09 UTC - in response to Message 38700. You would think that, being the smart people that they are, they would eventually figure out that impossible-to-run tasks drives crunchers away. Then, by removing those tasks, they could get more work done. Just the fact that they would like to get them done does not automatically accomplish it. Jim, full agreement from my side !!! But, as we can see: the reality is a different one, unfortunately :-( ID: 38701 · Reply Quote

[AF>Le_Pommier] Jerome_C2005 Send message Joined: 12 Jul 11 Posts: 100 Credit: 1,162,161 RAC: 1,479	Message 38752 - Posted: 9 May 2019, 10:04:49 UTC Last modified: 9 May 2019, 10:06:29 UTC Hi I realized yesterday that I had a Native Theory task that had been running for the past 12 days ! 18:31:11 2019-04-25: cranky-0.0.28: [INFO] Running Container 'runc'. 19:46:36 2019-04-26: cranky-0.0.28: [INFO] Pausing container TheoryN_2279-771890-48_0. 19:46:36 2019-04-26: cranky-0.0.28: [WARNING] Cannot pause container as /sys/fs/cgroup/freezer/boinc/freezer.state not exists. 20:22:43 2019-04-26: cranky-0.0.28: [INFO] Resuming container TheoryN_2279-771890-48_0. container not paused The status of the task in BoincTui was "dead" (I had never seen that) but it was eating a full CPU ! I killed it. I didn't read that thread but I liked the title. ID: 38752 · Reply Quote

Ray Murray Volunteer moderator Send message Joined: 29 Sep 04 Posts: 281 Credit: 11,866,264 RAC: 0	Message 39353 - Posted: 13 Jul 2019, 16:19:01 UTC 08:18:47 +0200 2019-07-13 [INFO] New Job Starting in slot1 08:18:47 +0200 2019-07-13 [INFO] Condor JobID: 503065.34 in slot1 08:18:52 +0200 2019-07-13 [INFO] MCPlots JobID: 50643654 in slot1 ===> [runRivet] Sat Jul 13 08:18:47 CEST 2019 [boinc pp jets 7000 150,-,2160 - sherpa 2.2.2 default 2000 78] Display update finished (0 histograms, 0 events). 2.47481e-12 pb +- ( 6.98515e-12 pb = 282.25 % ) 840000 ( 18224016 -> 8 % ) integration time: ( 7h 44m 28s elapsed / 24252d 19h 43m 44s left ) [17:59:12] Updating display... Euthanised humanely ID: 39353 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1446 Credit: 9,708,961 RAC: 766	Message 39354 - Posted: 13 Jul 2019, 17:38:03 UTC - in response to Message 39353. ===> [runRivet] Sat Jul 13 08:18:47 CEST 2019 [boinc pp jets 7000 150,-,2160 - sherpa 2.2.2 default 2000 78] ....... Euthanised humanely That's not patient for the patient ;) I digged into the recent MC Production and found 18000 successful events coming from 5 successful runs for that job description out of 40 attempts. 4 attempts failed and 31 were lost. ID: 39354 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1446 Credit: 9,708,961 RAC: 766	Message 39459 - Posted: 29 Jul 2019, 19:39:17 UTC Last modified: 29 Jul 2019, 19:44:53 UTC ===> [runRivet] Mon Jul 29 04:48:43 CEST 2019 [boinc pp jets 13000 250,-,4160 - sherpa 2.2.5 default 1000 81] A few weeks ago only 1 attempt out of 79 with the above job description succeeded.. The last 2 weeks from 6 new attempts 2 extra jobs succeeded so even with integration time: ( 9h 14m 7s elapsed / 21044d 22h 50m 51s left* )* [18:23:04], I will give it a try. ID: 39459 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2262 Credit: 175,581,097 RAC: 326	Message 39462 - Posted: 29 Jul 2019, 20:19:41 UTC Hi Crystal, this one finished today in -native 06:57:03 CEST +02:00 2019-07-29: cranky-0.0.29: [INFO] ===> [runRivet] Mon Jul 29 04:57:03 UTC 2019 [boinc pp jets 7000 80,-,1460 - sherpa 2.2.5 default 100000 84] 17:43:58 CEST +02:00 2019-07-29: cranky-0.0.29: [INFO] Container 'runc' finished with status code 0. ID: 39462 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1446 Credit: 9,708,961 RAC: 766	Message 39471 - Posted: 30 Jul 2019, 17:07:06 UTC - in response to Message 39459. ===> [runRivet] Mon Jul 29 04:48:43 CEST 2019 [boinc pp jets 13000 250,-,4160 - sherpa 2.2.5 default 1000 81] A few weeks ago only 1 attempt out of 79 with the above job description succeeded.. The last 2 weeks from 6 new attempts 2 extra jobs succeeded so even with integration time: ( 9h 14m 7s elapsed / 21044d 22h 50m 51s left* )* [18:23:04], I will give it a try. Unfortunately the VM was closed prematurely without my intervention and although I had set the VM's lifetime to 10 days. Don't know where that completion file came from so early. The job surely wasn't finished. Run time 1 days 18 hours 40 min 28 sec CPU time 1 days 15 hours 34 min 46 sec ID: 39471 · Reply Quote

LHC@home