Message boards :
Theory Application :
Theory's endless looping
Message board moderation
Previous · 1 . . . 3 · 4 · 5 · 6
Author | Message |
---|---|
Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0 ![]() ![]() |
The due date is easy. The script finds it in .../slots/x/init_data.xml as <computation_deadline>. Never mind, I figured it out. |
![]() Send message Joined: 15 Jun 08 Posts: 2628 Credit: 267,218,564 RAC: 128,995 ![]() ![]() |
... I'm not yet sure if there is a #3. "Condor runtime limit". This VM was shut down when the last job reached a runtime of a bit more than 36 h. I wonder if there is an additional watchdog that normally doesn't become active as the 18 h limit is usually reached before. Does anyone know? https://lhcathome.cern.ch/lhcathome/result.php?resultid=221858696 2019-04-28 17:42:59 (98673): Guest Log: [INFO] New Job Starting in slot1 2019-04-28 17:42:59 (98673): Guest Log: [INFO] Condor JobID: 495583.12 in slot1 2019-04-28 17:43:04 (98673): Guest Log: [INFO] MCPlots JobID: 49683949 in slot1 2019-04-28 17:44:23 Volunteer's Extension: Running Job: ===> [runRivet] Sun Apr 28 17:42:55 CEST 2019 [boinc ee zhad 91.2 - - sherpa 2.2.5 default 3000 48] . . . 2019-04-30 04:28:01 (98673): Status Report: Job Duration: '300000.000000' 2019-04-30 04:28:01 (98673): Status Report: Elapsed Time: '132045.354213' 2019-04-30 04:28:01 (98673): Status Report: CPU Time: '91213.620000' <edit>Expected a "Job finished in slot..." here but that's missing</edit> 2019-04-30 05:55:30 (98673): Guest Log: [INFO] Condor exited with return value N/A. 2019-04-30 05:55:30 (98673): Guest Log: [INFO] Shutting Down. |
Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0 ![]() ![]() |
... I'm not yet sure if there is a #3. "Condor runtime limit". See https://lhcathome.cern.ch/lhcathome/result.php?resultid=221836220 2019-04-29 23:02:03 (13231): Status Report: Job Duration: '864000.000000' 2019-04-29 23:02:03 (13231): Status Report: Elapsed Time: '151661.307370' 2019-04-29 23:02:03 (13231): Status Report: CPU Time: '145470.420000' 2019-04-29 23:02:34 (13231): Guest Log: [INFO] Job finished in slot1 with 0. 2019-04-29 23:12:45 (13231): Guest Log: [INFO] Condor exited with return value N/A. 2019-04-29 23:12:45 (13231): Guest Log: [INFO] Shutting Down. 2019-04-29 23:12:45 (13231): VM Completion File Detected. 2019-04-29 23:12:45 (13231): VM Completion Message: Condor exited with return value N/A. . 2019-04-29 23:12:45 (13231): Powering off VM. 2019-04-29 23:12:46 (13231): Successfully stopped VM. 2019-04-29 23:12:46 (13231): Deregistering VM. (boinc_ccb51ef676d1e747, slot#2) 2019-04-29 23:12:46 (13231): Removing network bandwidth throttle group from VM. 2019-04-29 23:12:46 (13231): Removing storage controller(s) from VM. 2019-04-29 23:12:46 (13231): Removing VM from VirtualBox. 2019-04-29 23:12:46 (13231): Removing virtual disk drive from VirtualBox. 23:12:51 (13231): called boinc_finish(0) If there is such an additional watchdog then it appears to be inconsistent. The above ran for 42 hours and got the expected "Job finished in slot" at 2019-04-29 23:02:34. |
![]() Send message Joined: 15 Jun 08 Posts: 2628 Credit: 267,218,564 RAC: 128,995 ![]() ![]() |
If there is such an additional watchdog then it appears to be inconsistent. Not necessarily. Your example shows a "VM state change" from running to paused and later back to running. This may have reset the shutdown timer. The runtime between the last state change and "Job finished" was less than 27 h. My example has been running for more than 36 h without a break. 2019-04-28 20:02:27 (25942): VM state change detected. (old = 'running', new = 'paused') . . . 2019-04-28 20:22:55 (13231): VM state change detected. (old = 'poweroff', new = 'running') . . . 2019-04-29 23:02:34 (13231): Guest Log: [INFO] Job finished in slot1 with 0. |
Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0 ![]() ![]() |
If there is such an additional watchdog then it appears to be inconsistent. I missed that line and yes it may have reset the timer if there is one (yours may shutdown for some reason other than a Condor imposed limit) |
![]() Send message Joined: 15 Jun 08 Posts: 2628 Credit: 267,218,564 RAC: 128,995 ![]() ![]() |
Again a longrunner where the last job failed after just a bit more than 36 h runtime: https://lhcathome.cern.ch/lhcathome/result.php?resultid=221925569 |
![]() Send message Joined: 15 Jun 08 Posts: 2628 Credit: 267,218,564 RAC: 128,995 ![]() ![]() |
Another task that was stopped at the 36-h-job-limit: https://lhcathome.cern.ch/lhcathome/result.php?resultid=221909367 |
Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0 ![]() ![]() |
Again a longrunner where the last job failed after just a bit more than 36 h runtime: OK, I'm convinced Condor imposes a 36.6 hour limit. If suspending the task causes Condor to reset the elapsed time counter then it would seem a watchdog could extend the job beyond 36.6 by: 1) suspending the task when current job reaches 36 hours 2) wait until VBox reports (via VBox Manage) that the VM has properly suspended 3) resume the task |
Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0 ![]() ![]() |
Again a longrunner where the last job failed after just a bit more than 36 h runtime: I modified the above 3 steps to something easier to code and observe. Steps are now: 1) suspend task at integer multiples of 35 hours elapsed task time (so 35, 70, 105...) 2) user manually resumes the task It seems to work. I have: https://lhcathome.cern.ch/lhcathome/result.php?resultid=222336270 with tet (task elapsed time) 2 days 9 hours crunching a Sherpa 2.2.4 with jet (job elapsed time) 50 hours https://lhcathome.cern.ch/lhcathome/result.php?resultid=222334188 with tet 2 days 8 hours crunching a Sherpa 1.2.3 with jet 41 hours. |
Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0 ![]() ![]() |
You would think that, being the smart people that they are, they would eventually figure out that impossible-to-run tasks drives crunchers away. Then, by removing those tasks, they could get more work done. Just the fact that they would like to get them done does not automatically accomplish it. |
Send message Joined: 18 Dec 15 Posts: 1862 Credit: 130,811,655 RAC: 102,426 ![]() ![]() ![]() |
You would think that, being the smart people that they are, they would eventually figure out that impossible-to-run tasks drives crunchers away. Jim, full agreement from my side !!! But, as we can see: the reality is a different one, unfortunately :-( |
Send message Joined: 12 Jul 11 Posts: 100 Credit: 1,162,161 RAC: 1,479 ![]() ![]() ![]() |
Hi I realized yesterday that I had a Native Theory task that had been running for the past 12 days ! 18:31:11 2019-04-25: cranky-0.0.28: [INFO] Running Container 'runc'. The status of the task in BoincTui was "dead" (I had never seen that) but it was eating a full CPU ! I killed it. I didn't read that thread but I liked the title. |
![]() ![]() Send message Joined: 29 Sep 04 Posts: 281 Credit: 11,866,264 RAC: 0 ![]() ![]() |
08:18:47 +0200 2019-07-13 [INFO] New Job Starting in slot1 08:18:47 +0200 2019-07-13 [INFO] Condor JobID: 503065.34 in slot1 08:18:52 +0200 2019-07-13 [INFO] MCPlots JobID: 50643654 in slot1 ===> [runRivet] Sat Jul 13 08:18:47 CEST 2019 [boinc pp jets 7000 150,-,2160 - sherpa 2.2.2 default 2000 78] Display update finished (0 histograms, 0 events). 2.47481e-12 pb +- ( 6.98515e-12 pb = 282.25 % ) 840000 ( 18224016 -> 8 % ) integration time: ( 7h 44m 28s elapsed / 24252d 19h 43m 44s left ) [17:59:12] Updating display... Euthanised humanely |
Send message Joined: 14 Jan 10 Posts: 1446 Credit: 9,708,961 RAC: 766 ![]() ![]() |
===> [runRivet] Sat Jul 13 08:18:47 CEST 2019 [boinc pp jets 7000 150,-,2160 - sherpa 2.2.2 default 2000 78]That's not patient for the patient ;) I digged into the recent MC Production and found 18000 successful events coming from 5 successful runs for that job description out of 40 attempts. 4 attempts failed and 31 were lost. |
Send message Joined: 14 Jan 10 Posts: 1446 Credit: 9,708,961 RAC: 766 ![]() ![]() |
===> [runRivet] Mon Jul 29 04:48:43 CEST 2019 [boinc pp jets 13000 250,-,4160 - sherpa 2.2.5 default 1000 81] A few weeks ago only 1 attempt out of 79 with the above job description succeeded.. The last 2 weeks from 6 new attempts 2 extra jobs succeeded so even with integration time: ( 9h 14m 7s elapsed / 21044d 22h 50m 51s left ) [18:23:04], I will give it a try. |
Send message Joined: 2 May 07 Posts: 2262 Credit: 175,581,097 RAC: 326 ![]() ![]() |
Hi Crystal, this one finished today in -native 06:57:03 CEST +02:00 2019-07-29: cranky-0.0.29: [INFO] ===> [runRivet] Mon Jul 29 04:57:03 UTC 2019 [boinc pp jets 7000 80,-,1460 - sherpa 2.2.5 default 100000 84] 17:43:58 CEST +02:00 2019-07-29: cranky-0.0.29: [INFO] Container 'runc' finished with status code 0. |
Send message Joined: 14 Jan 10 Posts: 1446 Credit: 9,708,961 RAC: 766 ![]() ![]() |
===> [runRivet] Mon Jul 29 04:48:43 CEST 2019 [boinc pp jets 13000 250,-,4160 - sherpa 2.2.5 default 1000 81]Unfortunately the VM was closed prematurely without my intervention and although I had set the VM's lifetime to 10 days. Don't know where that completion file came from so early. The job surely wasn't finished. Run time 1 days 18 hours 40 min 28 sec CPU time 1 days 15 hours 34 min 46 sec |
©2025 CERN