Thread 'How extend Theory VBox tasks?'

Author	Message
bronco Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0	Message 38625 - Posted: 24 Apr 2019, 19:23:29 UTC Reposting because I put the original in an inappropriate thread, moderator please delete https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4979&postid=38624#38624 Task elapsed time: 17:35:00 Job elapsed time: ~6 hours Log size: 49K and increasing slowly The full optimization time left is decreasing There are no "trigger phrases" (eg. param out of bounds) that indicate the job is not viable so my watchdog script has suspended this one to allow me to manually extend the task duration beyond the 18 hour limit. The plan for the future is to have the script repeatedly auto extend the duration until the log either shows "trigger phrases" or the log size exceeds an arbitrary maximum or task deadline is near, at which point the script shutsdown the task gracefully. ===> [runRivet] Wed Apr 24 03:33:38 MDT 2019 [boinc pp jets 7000 150,-,1860 - sherpa 2.2.5 default 24000 48] . . . 5.33737e-09 pb +- ( 3.08232e-10 pb = 5.77497 % ) 140000 ( 4513183 -> 4.2 % ) full optimization: ( 1h 10m 2s elapsed / 1h 32m 34s left ) [05:54:18] Updating display... Display update finished (0 histograms, 0 events). Updating display... Display update finished (0 histograms, 0 events). Updating display... Display update finished (0 histograms, 0 events). Updating display... Display update finished (0 histograms, 0 events). Updating display... Display update finished (0 histograms, 0 events). 5.33599e-09 pb +- ( 2.94039e-10 pb = 5.51048 % ) 150000 ( 4735937 -> 4.4 % ) full optimization: ( 1h 14m 51s elapsed / 1h 27m 19s left ) [05:59:15] Updating display... Display update finished (0 histograms, 0 events). Updating display... Display update finished (0 histograms, 0 events). Updating display... Display update finished (0 histograms, 0 events). Updating display... Display update finished (0 histograms, 0 events). Problem is I don't know how to extend the duration manually, never mind programatically. I have succesfully extended 2 tasks but 3 other attempts failed. So how does one extend the task duration? ID: 38625 · Reply Quote

Ray Murray Volunteer moderator Send message Joined: 29 Sep 04 Posts: 281 Credit: 11,888,115 RAC: 0	Message 38626 - Posted: 24 Apr 2019, 19:49:47 UTC - in response to Message 38625. Last modified: 24 Apr 2019, 19:55:06 UTC Hi Bronco (hidden original post as requested) Don't know how to extend an individual Task but I have managed a global extension by changing the stock job_duration from 64800s to 72000s or even 90000s (it doesn't like 100000) Suspend each individual Task Watch them pause in VBox Exit Boinc and watch the VMs save Find and edit, in Notepad or similar, Program Data / Boinc / projects / lhcathome / Theory_2017_05_29 XML Document Save (as xml if prompted) Restart Boinc Just looked and see you are all Linux. Don't know where you would find the corresponding file but this method works in Windows so hopefully points you in the right direction. ID: 38626 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1556 Credit: 10,100,748 RAC: 1,717	Message 38627 - Posted: 24 Apr 2019, 20:02:28 UTC - in response to Message 38626. Hi Bronco (hidden original post as requested) Don't know how to extend an individual Task but I have managed a global extension by changing the stock job_duration from 64800s to 72000s or even 90000s (it doesn't like 100000) Suddenly the post where I was responding to was hidden ;) So no respond there. The sherpa bronco wants to survive: pp jets 7000 150,-,1860 - sherpa 2.2.5 default - events done 758000 attemps 49 success 30 failure 1 lost 18 I've <job_duration>864000</job_duration> Not liking 100000 is probably because the filesize is checked by BOINC. Suppressing that can be done by adding an option line in cc_config.xml <dont_check_file_sizes>1</dont_check_file_sizes> ID: 38627 · Reply Quote

bronco Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0	Message 38628 - Posted: 25 Apr 2019, 0:40:38 UTC - in response to Message 38627. Thanks, Ray and Crystal, for the hints. Now I think programatically extending individual tasks is more trouble than it's worth. Fortunately there seems to be a better way, maybe. More on that later, first a few questions. The sherpa bronco wants to survive: pp jets 7000 150,-,1860 - sherpa 2.2.5 default - events done 758000 attemps 49 success 30 failure 1 lost 18 Where do you get the "events done 758000 attemps 49 success 30 failure 1 lost 18" data? What does it mean? You mentioned it for some reason that I don't understand, please explain. I've <job_duration>864000</job_duration> 10 days... that's not a typo? Actually I can see a good strategy developing out of that but I want to verify it's not a typo. Not liking 100000 is probably because the filesize is checked by BOINC. I get it. Changing the value from 64800 (a 5 digit string) to 100000 (a 6 digit string) adds 1 byte to the file size which makes BOINC think the file has been tampered with. ID: 38628 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1556 Credit: 10,100,748 RAC: 1,717	Message 38634 - Posted: 25 Apr 2019, 8:16:12 UTC - in response to Message 38628. Last modified: 25 Apr 2019, 8:21:17 UTC The sherpa bronco wants to survive: pp jets 7000 150,-,1860 - sherpa 2.2.5 default - events done 758000 attemps 49 success 30 failure 1 lost 18 Where do you get the "events done 758000 attemps 49 success 30 failure 1 lost 18" data? What does it mean? You mentioned it for some reason that I don't understand, please explain. It's coming from the MC Production site. In detail the list of all jobs of batch 2279: http://mcplots-dev.cern.ch/production.php?view=runs&rev=2279&display=all Huge page! Cause we mostly have problems with sherpa, I filter that page for sherpa's. When your running job has at least one success there is the possibility that your job could finish OK (time unpredictable). I've <job_duration>864000</job_duration> 10 days... that's not a typo? Actually I can see a good strategy developing out of that but I want to verify it's not a typo. No, it's not a typo. Normally a Theory VBox task will end shortly after 12 hours runtime when the last job has finished. The 18 hours is OK for killing error tasks, but not for possible successes. While I've the time to watch the jobs (overnight my main PC mostly is shutdown), I can investigate what's wrong (maybe a looper) or is it a long runner? From my investigation I can make the decision to end the task gracefully or give the job a chance to be a success. ID: 38634 · Reply Quote

bronco Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0	Message 38635 - Posted: 25 Apr 2019, 9:42:48 UTC - in response to Message 38626. Hi Bronco (hidden original post as requested) Don't know how to extend an individual Task but I have managed a global extension by changing the stock job_duration from 64800s to 72000s or even 90000s (it doesn't like 100000) Suspend each individual Task Watch them pause in VBox Exit Boinc and watch the VMs save Find and edit, in Notepad or similar, Program Data / Boinc / projects / lhcathome / Theory_2017_05_29 XML Document Save (as xml if prompted) Restart Boinc That extends the task but unfortunately the job (sub-task) that was running disappears and gets replaced by a new job. I though I was just confused the first time it happened so I tried extending a second task and again the job that was running vanished meaning it didn't just stop and show as a finished_xx.log along with an accompanying entry in the stdout.log. There was no mention of it in any of the finished_xx logs and even the stdout.log was deleted and restarted from 0 bytes. Oh well, not the first job I've messed up and it won't be the last. All part of the learning process :) It's looking like the way to reduce to near zero the number of viable (not looping) jobs that bump up against the 18 hour limit is to set <job_duration> very high (eg. 10 days). But then there needs to be a good watchdog script that detects loopers and shutsdown the task gracefully to prevent the dreaded 196 (0x000000C4) EXIT_DISK_LIMIT_EXCEEDED error. I think maybe I have that script and I think it's near ready for release. ID: 38635 · Reply Quote

bronco Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0	Message 38638 - Posted: 25 Apr 2019, 10:10:15 UTC - in response to Message 38634. The sherpa bronco wants to survive: pp jets 7000 150,-,1860 - sherpa 2.2.5 default - events done 758000 attemps 49 success 30 failure 1 lost 18 Where do you get the "events done 758000 attemps 49 success 30 failure 1 lost 18" data? What does it mean? You mentioned it for some reason that I don't understand, please explain. It's coming from the MC Production site. In detail the list of all jobs of batch 2279: http://mcplots-dev.cern.ch/production.php?view=runs&rev=2279&display=all Huge page! Cause we mostly have problems with sherpa, I filter that page for sherpa's. When your running job has at least one success there is the possibility that your job could finish OK (time unpredictable). OK, that's what I thought you meant. Thanks for the info. So a watchdog script could get the job's history and use that history to decide whether to kill a long runner or allow it to continue. That might be a complicated decision to code but I'm gonna give it a try. If I can't make it work then the alternative (just let it proceed if it's not looping) is OK too. I've <job_duration>864000</job_duration> 10 days... that's not a typo? Actually I can see a good strategy developing out of that but I want to verify it's not a typo. No, it's not a typo. Normally a Theory VBox task will end shortly after 12 hours runtime when the last job has finished. The 18 hours is OK for killing error tasks, but not for possible successes.[/quote] Yes, overall efficiency of the system suffers due to the 18 hour limit. While I've the time to watch the jobs (overnight my main PC mostly is shutdown), I can investigate what's wrong (maybe a looper) or is it a long runner? From my investigation I can make the decision to end the task gracefully or give the job a chance to be a success. As I mentioned to Ray upthread, I think I have a watchdog script that will do a very good job of detecting and killing the loopers while allowing viable long runners to proceed. It needs a little more testing and polishing but it's near ready for release. ID: 38638 · Reply Quote

Ray Murray Volunteer moderator Send message Joined: 29 Sep 04 Posts: 281 Credit: 11,888,115 RAC: 0	Message 38639 - Posted: 25 Apr 2019, 10:19:58 UTC - in response to Message 38635. Sometimes the VM fails to save correctly and reboots on restart, losing the running Job but by checking that the Tasks Pause before shutting down Boinc, and then Save in VBox on shutdown, I usually find that the Job resumes on Boinc restart. Good luck your script. ID: 38639 · Reply Quote

bronco Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0	Message 38640 - Posted: 25 Apr 2019, 12:15:08 UTC - in response to Message 38639. Sometimes the VM fails to save correctly and reboots on restart, losing the running Job but by checking that the Tasks Pause before shutting down Boinc, and then Save in VBox on shutdown, I usually find that the Job resumes on Boinc restart. To tell the truth I didn't open VBox and confirm, I just gave it several minutes. Obviously I didn't give it enough time. Obviously then, for a watchdog to do a proper job it would have to confirm (via the VBoxManage utility and other means) each step of the procedure. Add to that the fact that other projects might be sensitive to how/when BOINC is restarted and it becomes a nightmare. Again, the simpler way is to give looooong runners ample time is to just give all Theory Vbox tasks a huge job_duration then do a good job of detecting and killing loopers. Good luck your script. Thanks for these discussions. They have helped me clarify what the script needs to do and how to do it. ID: 38640 · Reply Quote

Henry Nebrensky Send message Joined: 13 Jul 05 Posts: 170 Credit: 15,020,549 RAC: 0	Message 38644 - Posted: 26 Apr 2019, 15:20:47 UTC - in response to Message 38634. Where do you get the "events done 758000 attemps 49 success 30 failure 1 lost 18" data? What does it mean? You mentioned it for some reason that I don't understand, please explain. It's coming from the MC Production site. In detail the list of all jobs of batch 2279: http://mcplots-dev.cern.ch/production.php?view=runs&rev=2279&display=all Huge page! Cause we mostly have problems with sherpa, I filter that page for sherpa's. Just looking at that by eye, the distribution is clearly bimodal: jobs (here, code and parameter combinations) almost always either succeed ( >80% success) or fail dismally (<5% success). I don't know how they are submitted behind the scenes, but to me it looks as though it would be better to submit the first say 12 sub-jobs first and only continue if at least 8 are a success. ID: 38644 · Reply Quote

m Send message Joined: 6 Sep 08 Posts: 119 Credit: 15,002,862 RAC: 5,398	Message 38653 - Posted: 27 Apr 2019, 12:47:35 UTC - in response to Message 38628. Last modified: 27 Apr 2019, 13:42:54 UTC Where do you get the "events done 758000 attemps 49 success 30 failure 1 lost 18" data? You can get info like this... From one of the MC pages, such as the status page or the one that shows the results for your user id, you may already have a shortcut to this, http://mcplots-dev.cern.ch/production.php?view=user&system=3&userid=xxxxxx click "Control". Then, on the highlighted line, click the entry in the "coverage" column. The "Runs summary" lets you pick the results category you want.. I don't know why results are "Masked". maybe it's not as simple as it looks but the rest seem self explanatory. Try "Unsuccessful". I think that all these are run on BOINC, maybe not all by us volunteers. I'm sure someone from the project will explain further. ID: 38653 · Reply Quote