41) Message boards : Theory Application : Issues Native Theory application (Message 38671)
Posted 30 Apr 2019 by bronco
Post:
That way, if Native Theory hangs up, I can at least limit the number of cores affected (until bronco comes up with a fix).

What do you mean by "if Native Theory hangs up"? If you mean the problem where the task runs into the deadline and doesn't stop, the latest version of my watchdog handles that by aborting the task 1 hour before deadline. It would be nice if there was a way to do a graceful shutdown but native Theory doesn't have that facility. Aborting the task isn't really what most volunteers will regard as a solution but it's better than just letting the task run until the server cancels it (which the server doesn't seem to be doing ATM).

The only other problem I have noticed with native Theory is tasks ending with the 195 EXIT_CHILD_FAILED error which I don't understand and don't have a way to handle, yet.

The 196 (0x000000C4) EXIT_DISK_LIMIT_EXCEEDED problem doesn't seem to affect native Theory which is fortunate because I see no way for a watchdog running on the user's account to detect the condition.

So I think my watchdog does everything it can possibly do for both native and VBox Theory. Unless somebody has any further suggestions, I believe it's ready for beta test :)
42) Message boards : Theory Application : Theory's endless looping (Message 38670)
Posted 30 Apr 2019 by bronco
Post:
... I'm not yet sure if there is a #3. "Condor runtime limit".

This VM was shut down when the last job reached a runtime of a bit more than 36 h.
I wonder if there is an additional watchdog that normally doesn't become active as the 18 h limit is usually reached before.

Does anyone know?

See https://lhcathome.cern.ch/lhcathome/result.php?resultid=221836220
2019-04-29 23:02:03 (13231): Status Report: Job Duration: '864000.000000'
2019-04-29 23:02:03 (13231): Status Report: Elapsed Time: '151661.307370'
2019-04-29 23:02:03 (13231): Status Report: CPU Time: '145470.420000'
2019-04-29 23:02:34 (13231): Guest Log: [INFO] Job finished in slot1 with 0.
2019-04-29 23:12:45 (13231): Guest Log: [INFO] Condor exited with return value N/A.
2019-04-29 23:12:45 (13231): Guest Log: [INFO] Shutting Down.
2019-04-29 23:12:45 (13231): VM Completion File Detected.
2019-04-29 23:12:45 (13231): VM Completion Message: Condor exited with return value N/A.
.
2019-04-29 23:12:45 (13231): Powering off VM.
2019-04-29 23:12:46 (13231): Successfully stopped VM.
2019-04-29 23:12:46 (13231): Deregistering VM. (boinc_ccb51ef676d1e747, slot#2)
2019-04-29 23:12:46 (13231): Removing network bandwidth throttle group from VM.
2019-04-29 23:12:46 (13231): Removing storage controller(s) from VM.
2019-04-29 23:12:46 (13231): Removing VM from VirtualBox.
2019-04-29 23:12:46 (13231): Removing virtual disk drive from VirtualBox.
23:12:51 (13231): called boinc_finish(0)

If there is such an additional watchdog then it appears to be inconsistent. The above ran for 42 hours and got the expected "Job finished in slot" at 2019-04-29 23:02:34.
43) Message boards : Theory Application : Theory's endless looping (Message 38657)
Posted 27 Apr 2019 by bronco
Post:
The due date is easy. The script finds it in .../slots/x/init_data.xml as <computation_deadline>.

Right.

In ../slots/4/init_data.xml I see <rsc_disk_bound>8000000000.000000</rsc_disk_bound>

This is the value (in bytes) the slots folder must not exceed.


OK. But I think you meant to say "slot folders" rather than "slots folder", yes? In other words monitor the size of the numbered slot folders in the folder named "slots", not the size of the folder named "slots".

Never mind, I figured it out.
44) Message boards : Theory Application : Theory's endless looping (Message 38656)
Posted 27 Apr 2019 by bronco
Post:
The due date is easy. The script finds it in .../slots/x/init_data.xml as <computation_deadline>.

Right.

In ../slots/4/init_data.xml I see <rsc_disk_bound>8000000000.000000</rsc_disk_bound>

This is the value (in bytes) the slots folder must not exceed.


OK. But I think you meant to say "slot folders" rather than "slots folder", yes? In other words monitor the size of the numbered slot folders in the folder named "slots", not the size of the folder named "slots".
45) Message boards : Theory Application : 196 (0x000000C4) EXIT_DISK_LIMIT_EXCEEDED - how come? (Message 38650)
Posted 27 Apr 2019 by bronco
Post:
the next one:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=221744502
after almost 14 hours :-(((

Can someone from LHC explain the rationale behind submitting all these faulty sherpa tasks?


Someone named Peter Skands who claims to be from LHC explained it in this thread on April 3. It appears you're not buying his explanation. Why? Because he's an imposter not really from LHC? Or maybe because he confirms what I said is the rationale, reproducability, which was in turn based on what Massimo explained last year? Or maybe you think Massimo is an impostor too? Fake news all of them?

Edit update:

OMG! I just got off the phone with Toddler Tyrant (Trump) and he confirms it.... they all graduated from a real university so they're all fake news. Who could have known?
46) Message boards : Theory Application : Theory's endless looping (Message 38649)
Posted 27 Apr 2019 by bronco
Post:
The ETA may not tell the truth.
Hence there's no real reason to shutdown those tasks.

If a babysitter service is active beside BOINC a task should only be shut down close before it hits one of the hard limits which are:
1. EXIT_DISK_LIMIT_EXCEEDED
2. BOINC due date reached

I'm not yet sure if there is a #3. "Condor runtime limit".
If so it would replace (2.).


The due date is easy. The script finds it in .../slots/x/init_data.xml as <computation_deadline>.
The disk limit is what I'm not sure about. In ../slots/4/init_data.xml I see <rsc_disk_bound>8000000000.000000</rsc_disk_bound> but VBoxManage -q showhdinfo "/var/lib/boinc-client/slots/4/vm_image.vdi" says "Capacity: 20480 MBytes"... 8 GB versus 20 GB. Which figure should I use?
47) Message boards : Theory Application : Theory's endless looping (Message 38642)
Posted 26 Apr 2019 by bronco
Post:
Integration time left is 750 days, log filesize is 2.2 MB so I let the script gracefully terminate it.
48) Message boards : Theory Application : Theory's endless looping (Message 38641)
Posted 26 Apr 2019 by bronco
Post:
===> [runRivet] Thu Apr 25 22:33:11 MDT 2019 [boinc ee zhad 22 - - sherpa 1.3.0 default 1000 48]

An hour ago my watchdog script flagged this one for graceful shutdown due to integration time > 1 day. The script hasn't actually deleted it because it's running in "flag but don't shutdown" mode. Now, an hour later, integration time has increased to 90 days. Log filesize is 360 KB and it's processed 0 events.

The MC Production site has this report: ee zhad 22 - - sherpa 1.3.0 default events 0, atttempts 26, success 0, failure 1, lost 25

Does anybody see any hope for this one? I don't but I'm going to give it another 20 hours (task duration is set for 10 days) to see how large the filesize and integration time become. I'm wondering if the integration time will peak and then decrease to 0 at which point it might start processing events. ATM it looks like it's heading for a 196 (0x000000C4) EXIT_DISK_LIMIT_EXCEEDED shutdown.
49) Message boards : Theory Application : How extend Theory VBox tasks? (Message 38640)
Posted 25 Apr 2019 by bronco
Post:
Sometimes the VM fails to save correctly and reboots on restart, losing the running Job but by checking that the Tasks Pause before shutting down Boinc, and then Save in VBox on shutdown, I usually find that the Job resumes on Boinc restart.


To tell the truth I didn't open VBox and confirm, I just gave it several minutes. Obviously I didn't give it enough time. Obviously then, for a watchdog to do a proper job it would have to confirm (via the VBoxManage utility and other means) each step of the procedure. Add to that the fact that other projects might be sensitive to how/when BOINC is restarted and it becomes a nightmare. Again, the simpler way is to give looooong runners ample time is to just give all Theory Vbox tasks a huge job_duration then do a good job of detecting and killing loopers.

Good luck your script.

Thanks for these discussions. They have helped me clarify what the script needs to do and how to do it.
50) Message boards : Theory Application : How extend Theory VBox tasks? (Message 38638)
Posted 25 Apr 2019 by bronco
Post:
The sherpa bronco wants to survive: pp jets 7000 150,-,1860 - sherpa 2.2.5 default - events done 758000 attemps 49 success 30 failure 1 lost 18
Where do you get the "events done 758000 attemps 49 success 30 failure 1 lost 18" data?
What does it mean? You mentioned it for some reason that I don't understand, please explain.
It's coming from the MC Production site.
In detail the list of all jobs of batch 2279: http://mcplots-dev.cern.ch/production.php?view=runs&rev=2279&display=all Huge page!
Cause we mostly have problems with sherpa, I filter that page for sherpa's.
When your running job has at least one success there is the possibility that your job could finish OK (time unpredictable).

OK, that's what I thought you meant. Thanks for the info. So a watchdog script could get the job's history and use that history to decide whether to kill a long runner or allow it to continue. That might be a complicated decision to code but I'm gonna give it a try. If I can't make it work then the alternative (just let it proceed if it's not looping) is OK too.

I've <job_duration>864000</job_duration>
10 days... that's not a typo? Actually I can see a good strategy developing out of that but I want to verify it's not a typo.

No, it's not a typo. Normally a Theory VBox task will end shortly after 12 hours runtime when the last job has finished.
The 18 hours is OK for killing error tasks, but not for possible successes.[/quote]
Yes, overall efficiency of the system suffers due to the 18 hour limit.

While I've the time to watch the jobs (overnight my main PC mostly is shutdown), I can investigate what's wrong (maybe a looper) or is it a long runner?
From my investigation I can make the decision to end the task gracefully or give the job a chance to be a success.

As I mentioned to Ray upthread, I think I have a watchdog script that will do a very good job of detecting and killing the loopers while allowing viable long runners to proceed. It needs a little more testing and polishing but it's near ready for release.
51) Message boards : Theory Application : How extend Theory VBox tasks? (Message 38635)
Posted 25 Apr 2019 by bronco
Post:
Hi Bronco (hidden original post as requested)
Don't know how to extend an individual Task but I have managed a global extension by changing the stock job_duration from 64800s to 72000s or even 90000s (it doesn't like 100000)

Suspend each individual Task
Watch them pause in VBox
Exit Boinc and watch the VMs save
Find and edit, in Notepad or similar, Program Data / Boinc / projects / lhcathome / Theory_2017_05_29 XML Document
Save (as xml if prompted)
Restart Boinc


That extends the task but unfortunately the job (sub-task) that was running disappears and gets replaced by a new job. I though I was just confused the first time it happened so I tried extending a second task and again the job that was running vanished meaning it didn't just stop and show as a finished_xx.log along with an accompanying entry in the stdout.log. There was no mention of it in any of the finished_xx logs and even the stdout.log was deleted and restarted from 0 bytes. Oh well, not the first job I've messed up and it won't be the last. All part of the learning process :)

It's looking like the way to reduce to near zero the number of viable (not looping) jobs that bump up against the 18 hour limit is to set <job_duration> very high (eg. 10 days). But then there needs to be a good watchdog script that detects loopers and shutsdown the task gracefully to prevent the dreaded 196 (0x000000C4) EXIT_DISK_LIMIT_EXCEEDED error. I think maybe I have that script and I think it's near ready for release.
52) Message boards : Theory Application : How extend Theory VBox tasks? (Message 38628)
Posted 25 Apr 2019 by bronco
Post:
Thanks, Ray and Crystal, for the hints. Now I think programatically extending individual tasks is more trouble than it's worth. Fortunately there seems to be a better way, maybe. More on that later, first a few questions.

The sherpa bronco wants to survive: pp jets 7000 150,-,1860 - sherpa 2.2.5 default - events done 758000 attemps 49 success 30 failure 1 lost 18

Where do you get the "events done 758000 attemps 49 success 30 failure 1 lost 18" data?
What does it mean? You mentioned it for some reason that I don't understand, please explain.


I've <job_duration>864000</job_duration>

10 days... that's not a typo? Actually I can see a good strategy developing out of that but I want to verify it's not a typo.

Not liking 100000 is probably because the filesize is checked by BOINC.

I get it. Changing the value from 64800 (a 5 digit string) to 100000 (a 6 digit string) adds 1 byte to the file size which makes BOINC think the file has been tampered with.
53) Message boards : Theory Application : How extend Theory VBox tasks? (Message 38625)
Posted 24 Apr 2019 by bronco
Post:
Reposting because I put the original in an inappropriate thread, moderator please delete https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4979&postid=38624#38624
Task elapsed time: 17:35:00
Job elapsed time: ~6 hours
Log size: 49K and increasing slowly
The full optimization time left is decreasing
There are no "trigger phrases" (eg. param out of bounds) that indicate the job is not viable so my watchdog script has suspended this one to allow me to manually extend the task duration beyond the 18 hour limit. The plan for the future is to have the script repeatedly auto extend the duration until the log either shows "trigger phrases" or the log size exceeds an arbitrary maximum or task deadline is near, at which point the script shutsdown the task gracefully.

===> [runRivet] Wed Apr 24 03:33:38 MDT 2019 [boinc pp jets 7000 150,-,1860 - sherpa 2.2.5 default 24000 48]
.
.
.
5.33737e-09 pb +- ( 3.08232e-10 pb = 5.77497 % ) 140000 ( 4513183 -> 4.2 % )
full optimization:  ( 1h 10m 2s elapsed / 1h 32m 34s left ) [05:54:18]   
Updating display...
Display update finished (0 histograms, 0 events).
Updating display...
Display update finished (0 histograms, 0 events).
Updating display...
Display update finished (0 histograms, 0 events).
Updating display...
Display update finished (0 histograms, 0 events).
Updating display...
Display update finished (0 histograms, 0 events).
5.33599e-09 pb +- ( 2.94039e-10 pb = 5.51048 % ) 150000 ( 4735937 -> 4.4 % )
full optimization:  ( 1h 14m 51s elapsed / 1h 27m 19s left ) [05:59:15]   
Updating display...
Display update finished (0 histograms, 0 events).
Updating display...
Display update finished (0 histograms, 0 events).
Updating display...
Display update finished (0 histograms, 0 events).
Updating display...
Display update finished (0 histograms, 0 events).


Problem is I don't know how to extend the duration manually, never mind programatically. I have succesfully extended 2 tasks but 3 other attempts failed. So how does one extend the task duration?
54) Message boards : Theory Application : Theory's endless looping (Message 38542)
Posted 6 Apr 2019 by bronco
Post:
Ooops: Event 600 ( 1h 3m 42s elapsed / 1d 6h 47m 27s left ) -> ETA: Sat Apr 06 01:55
Added 1 success to pp jets 8000 350 - sherpa 1.4.3 default

Cool! If you can extend the 18 hour limit manually then I can do it in my watchdog script too. I wish I had thought of this solution for Theory VBox.
55) Message boards : Theory Application : Theory's endless looping (Message 38496)
Posted 30 Mar 2019 by bronco
Post:
Unlike Theory (vbox) Theory native has no hard 18h limit.
What remains is the due date set by the BOINC server.
This can't be set to infinite to catch other issues like non reponding hosts.
The task will continue on the host but will be treated as invalid when the client reports it.

Ray Murray wrote:
For the huge logfile ones, could they be set to terminate the JOB on reaching the limit rather killing the TASK?

+1
It would require a watchdog inside the VM that uploads the logs (maybe not the extra large ones) and initiates a graceful shutdown.


CP posted some numbers in the Theory nativ thread:
https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4979&postid=38451
It shows that most of the work should be able to finish within the standard time limits.

Nonetheless those failing task are annoying.


Best would of course be to analyse the reason for longrunners and huge logs.

In a post from almost a year ago https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4028&postid=35142#35142, Peter Skands suggests some number of bugs in older versions of sherpa have been fixed but the older versions have not been deprecated and are still being used occasionally for reasons that make sense to me. Hopefully when native Theory transitions from this test phase to production, those older versions will be dropped. A watchdog running inside the container/VM could easily manage problems born of newer versions. A watchdog like mine runs outside the container/VM on the volunteer's account so it's limited in what it can do.
56) Message boards : Theory Application : Theory's endless looping (Message 38493)
Posted 30 Mar 2019 by bronco
Post:
I have seen some long ones, and have not been quite sure whether to abort them or not.
(How do I check?)

If yours is a standard BOINC for Linux install then the task's running log is at /var/lib/boinc-client/<slot>/cernvm/shared/runRivet.log. You can open runRivet.log in a text editor if you want to view the entire log or you can use Linux's tail command to view the last x lines of the log 1 time. Combine the tail and watch commands to view the tail end repeatedly every x seconds.
Interpreting what you see in the log is not easy but if you follow the clues posted by Crystal Pellet and others in this thread you'll eventually get the hang of it. After the novelty wears off you'll eventually long for a script or app to do the work for you as it is tedious, monotonous and if you don't stay on task 24/7/365 loopers and long runners will slip by you.

I've been developing my watchdog script for over a year now in order to automate both the data gathering and the analysis. Trust me when I tell you there is no cut and dried sure fire way to determine if the job is viable. I have found that sometimes, a job that shows all the indications of being either a looper or a long runner progressing so slowly it cannot possibly complete in the time remaining, will suddenly stop looping and proceed or else speed up and complete. It's a judgement call based on more or less arbitrary criteria, as much an art as it is science.

The nice thing about Theory VBox is that you (or a watchdog scrip) can shutdown a looper or long runner gracefully so it still earns credits even if it does no useful work. Native Theory and native ATLAS tasks cannot be shutdown gracefully. They can only be aborted (which means no credits) so the temptation is to use more forgiving criteria and maybe let them run a little longer (in for a penny in for a pound as they say) but the question becomes "how long is the average volunteer prepared to let it run when it might run for several days and yield 0 credits. So there is the temptation to go NIMBY on it and just abort all sherpa jobs immediately. I don't like that approach but the option is there in my watchdog script anyway.

Another approach is to abort sherpas on the basis of the number of events the job is configured to process (the target events number). I adjusted my script to do that but as Crystal pointed out there is no correlation between the target events number and the job's likelihood of failure. So I think I'll be removing that part of the code.

I think the sherpa version number is a better (say more reliable but again not 100% reliable) criteria to determine chance of success so I'm in the process of adding that mechanism to my script. It's grown from a relatively easy to program/modify text mode app into a GUI app so users don't have to struggle with config files, command line options, etc. The problem is GUIs are much harder to program and debug. Recently I've taken to the notion that it should work across a LAN... a single GUI client that communicates with servers running on individual hosts similar to BOINC's monitor and client. BIG job for me.
57) Message boards : Theory Application : (Native) Theory - Sherpa looooooong runners (Message 38466)
Posted 27 Mar 2019 by bronco
Post:
2 days 5 hours runtime- yes we need some time for this "small" tasks.
Good point. Modifying my watchdog script to abort sherpa if it's configured for more than 2K events. Or maybe the limit should be 4K events?
There is no one to one relationship between the number of events and the run time. I've noticed that the combination beam "ee" and "sherpa" rather often fails.
If you want to avoid sherpa long-runners with your watchdog, just abort all sherpa's at the beginning. It's up to you. It's your time, your machine and your electricity.

Someone else will do the job.

And lots of those someone elses will run into the task deadline. What will be the response to their complaints... you may deselect native Theory and select Theory VBox instead?

Or maybe allow the user to select the limit.
This is impossible. You cannot choose the kind of Theory generator, let alone the number of events.

I know we cannot choose the generator and number of events. I know those parameters are set by the server and they are immutable.
I meant when the task starts the script compares the job's target events to the user's selected limit and if the target events exceeds the user's limit then the script aborts the task. It does not try to substitute the user's events limit for the target events value sent by the server.
58) Message boards : Theory Application : (Native) Theory - Sherpa looooooong runners (Message 38464)
Posted 27 Mar 2019 by bronco
Post:
The native version gives you the opportunity to check the app's logfile and see if there is any progress.
tail -fn500 /your_local_boinc_dir/slots/x/cernvm/shared/runRivet.log

That's manual work most volunteers won't want to do especially if they run native Theory on more than 1 or 2 hosts. Hence the need for a watchdog script to automate the chore of checking for progress.
59) Message boards : Theory Application : (Native) Theory - Sherpa looooooong runners (Message 38453)
Posted 26 Mar 2019 by bronco
Post:
Bronco,
stop this writing from your sight about Sherpa. You stay alone with this theory!!

I don't think you really understand my theory. Until you understand you should stop telling me when/how to post. If that doesn't suit you then take a long hard suck on my ass.
60) Message boards : Theory Application : (Native) Theory - Sherpa looooooong runners (Message 38450)
Posted 26 Mar 2019 by bronco
Post:
2 days 5 hours runtime- yes we need some time for this "small" tasks.

Good point. Modifying my watchdog script to abort sherpa if it's configured for more than 2K events. Or maybe the limit should be 4K events? Or maybe allow the user to select the limit.


Previous 20 · Next 20


©2024 CERN