Message boards : Theory Application : Theory's endless looping
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · Next

AuthorMessage
bronco

Send message
Joined: 13 Apr 18
Posts: 443
Credit: 8,438,885
RAC: 0
Message 35518 - Posted: 13 Jun 2018, 19:50:08 UTC - in response to Message 35142.  

Hi all,

Thanks as always for your dedication and patience. As Crystal mentions, the looping issue only affects a few Sherpa jobs, and as I think I've stated elsewhere, we still want to run those Sherpa versions to be able to display comparisons with them - as long as they are being used actively by researchers in the community. In the major new update of the jobs we are planning to start sending out shortly, some of the oldest Sherpa versions that we don't deem are being used any more will be deprecated, and those also appear to be the most frequent loopers.


Thanks for your feedback and info regarding upcoming plans for Sherpa. Here is an update regarding icebergs, the monitoring script I've been working on and the data it has gathered.

Of icebergs it is frequently said that we see only 10% because 90% is submerged. I was thinking (and maybe other volunteers as well) that *maybe* we see only 10% of the Sherpa problem. As mere mortals we cannot watch Sherpa 24/7/365 and so the solution is a script that can do what we cannot do. It took some time to work the kinks out of the script and test its reliability but I am confident the data below is a 100% accurate reflection of what has been happening on my system. It goes without saying the data is a very small sample of the superset.

So far the script has tracked 1398 jobs as follows:
  Generator          Started          Failed
  =========          =======          ======
  epos                  13              0
  herwig++              99              0
  herwig++powheg         5              0
  phojet                 6              0
  pythia              1232              0
  sherpa                43              6

            Totals:   1398              6


So of the 1398 jobs tracked, only 3% were Sherpa jobs, only 14% of those Sherpas failed, only 0.4% of all lobs failed. The obvious question is, "How does the script define "failed"? I will attempt to answer that question but you need to know a little about the algorithm to understand the answer. Remember, the script gathers all of its observations and makes its decisions based on the content of the running.log. That is all it has to work with. From that log it seems the jobs have 2 phases. The first phase is initialization; the second phase is event generation. After watching jobs for weeks it became apparent that Sherpa jobs can hang in the first phase as well as the second phase. The only way I can think of to deal with the lack of verbosity in the log in the first phase is to put an arbitrary time limit on the first phase. 3 of the 6 failed Sherpa jobs failed just because they exceeded that time limit. For the data presented above the time limit was 4 hours and the host was this host. Maybe 4 hours is too strict? Maybe 4 is too lenient? I don't know. As I said it's just an arbitrary number. Maybe there should not be a time limit on phase 1 but then is there another more sensible way to trap a hang in phase 1?

Phase 2 is easier to treat. The algorithm fails (terminates gracefully) a job for either of 2 reasons:

Phase 2 fail mode 1 is based on the "estimated time to completion" some (not all) Sherpa tasks provide in phase 2. Maybe I am wrong but my tests and observations indicated the estimate is quite accurate. If the estimate is greater than the BOINC task's remaining time plus 1 hour then the job fails. Thus it gets 1 hour of grace.

Phase 2 fail mode 2 is based on what I refer to as DUF lines. DUF stands for "display update finished". DUF lines are of the form
Display update finished (x histograms, y events)

The job fails if the last 40 DUFs are exactly the same with no increases to x and y. In other words if the algorithm counts 39 DUFs that are the same but the next DUF is different then the count reverts to 0.

I cannot promise that there will not still be *some* looping jobs happening in the newer versions. At least a partial consolation is that, as far as I know, you are at least still getting credits for them, even though I totally understand that it is frustrating that your CPU is basically idling during those jobs and not contributing to science.


Others might disagree but in my opinion the current situation is not as bad as it might seem. Even taking a very stringent, maybe even "paranoic" definition of what constitutes a failed job as my script does the failure rate is less than 1% of all jobs. That's very acceptable IMHO.

I keep being impressed and amazed at what the volunteer community is capable of; the development of a script that some of you guys have been talking about, that automatically detects the looping run condition and gracefully shuts down the job, is extremely nice. Although we don't have a lot of manpower currently for upgrading the run software, I will try hard to see if we can incorporate such a trigger in our default job setups, so that this could be done automatically. This is really great work!

All the best,
Peter


Thanks for acknowledging my efforts but do realize I am retired and looking for projects to pass the time. It's not like I work a regular job and then do this in my limited spare time.

So what will become of my script? Well, the data gathered by the script and presented above suggests to me there is little need for such a script. If I see some requests I'll make it available, free open source of course. Maybe it can be grown into something more useful, I am open to suggestions.

BTW, it started out as a simple Python script running in a terminal, no GUI, log to a disk file, etc. Over time it has evolved into a GUI app of 1400 lines in Python and wxPython.
ID: 35518 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2402
Credit: 225,585,384
RAC: 121,146
Message 35519 - Posted: 13 Jun 2018, 20:10:35 UTC - in response to Message 35518.  

Very nice!

Failure rate < 1% is indeed not very much.
Would be interesting to know if the generator histogram is specific to your host, a distinct type of hosts or if it reflects the general distribution.
ID: 35519 · Report as offensive     Reply Quote
bronco

Send message
Joined: 13 Apr 18
Posts: 443
Credit: 8,438,885
RAC: 0
Message 35540 - Posted: 17 Jun 2018, 1:05:17 UTC - in response to Message 35519.  

Very nice!

Failure rate < 1% is indeed not very much.


Yes, about 0.4%, but it might even be lower than that. Keep in mind that in addition to failing a job for looping in phase2 my script also fails jobs if they:

1) appear to be taking too long in phase1
OR
2) or if the time to completion estimate in phase2 exceeds remaining task time plus 1 hour.

I didn't report this in my previous post but the script has an option to simply report (but not terminate) a job that exceeds the phase1 time limit. With that option turned on I found that some jobs that exceeded the time limit continued on to normal completion. It seems reasonable to assume then that some of the failures I reported in my last post were not genuine failures and that if they had been allowed a little more time in phase1 they might have completed and the failure rate would be even less than 0.4%

The same argument applies to point 2 from above. The script gives a job 1 hour of grace in phase2 in case the job's estimate of time to completion is wrong but maybe if 2 hours grace were allowed the job might complete. That's the problem with imposing arbitrary limits when one is only looking in from the outside.

Would be interesting to know if the generator histogram is specific to your host, a distinct type of hosts or if it reflects the general distribution.


That would be be interesting. In the VM console for every Theory task I see "Running the fast benchmark" followed by "Machine performance x HEPSEC06" which suggests they are testing the host's compute power. Perhaps it's just to make sure it meets a minimum requirement. Or perhaps the result is used to direct longer jobs to faster machines. If the latter then they might be directing Sherpa jobs to faster hosts because it's fairly evident to me that Sherpa jobs take MUCH longer than Pythia jobs.
ID: 35540 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2090
Credit: 158,958,761
RAC: 124,814
Message 35541 - Posted: 17 Jun 2018, 2:06:53 UTC
Last modified: 17 Jun 2018, 2:07:36 UTC

on the lhcathome-page in:
Jobs->Theory Jobs->MC Production Control->Revision coverage with the actual Date/Time
coverage-> unsuccessful

is a stat to see wrong pythia sherpa herwig++ and other Theory-jobs in real time over all user.
It is not useful to set this link here, because of changing Control/Revision name from the LHC-System permanently.
ID: 35541 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1276
Credit: 8,481,307
RAC: 1,952
Message 37145 - Posted: 31 Oct 2018, 21:45:24 UTC

Another looping Sherpa!

===> [runRivet] Wed Oct 31 09:33:20 CET 2018 [boinc ppbar uemb-soft 53 - - sherpa 2.1.1 default 1000 636]
.
.
.
7.62561e+08 pb +- ( 2.56383e+06 pb = 0.336214 % ) 280000 ( 584382 -> 48.5 % )
integration time:  ( 11m 39s elapsed / 1m 15s left ) [10:01:22]   
Updating display...
Display update finished (0 histograms, 0 events).
7.62766e+08 pb +- ( 2.46376e+06 pb = 0.323004 % ) 300000 ( 625437 -> 48.6 % )
integration time:  ( 12m 32s elapsed / 25s left ) [10:02:18]   
7.62401e+08 pb +- ( 2.41357e+06 pb = 0.316575 % ) 310000 ( 646032 -> 48.6 % )
integration time:  ( 12m 59s elapsed / 0s left ) [10:02:46]   
2_2__j__j__j__j : 7.62401e+08 pb +- ( 2.41357e+06 pb = 0.316575 % )  exp. eff: 0.328321 %
  reduce max for 2_2__j__j__j__j to 0.62967 ( eps = 0.001 ) 
Output_Phase::Output_Phase(): Set output interval 1000000000 events.
----------------------------------------------------------
-- SHERPA generates events with the following structure --
----------------------------------------------------------
Perturbative       : Signal_Processes
Perturbative       : Hard_Decays
Perturbative       : Jet_Evolution:CSS
Perturbative       : Lepton_FS_QED_Corrections:Photons
Perturbative       : Multiple_Interactions:Amisic
Perturbative       : Minimum_Bias:Off
Hadronization      : Beam_Remnants
Hadronization      : Hadronization:Ahadic
Hadronization      : Hadron_Decays
Analysis           : HepMC2
Updating display...
Display update finished (0 histograms, 0 events).
Updating display...
Display update finished (0 histograms, 0 events).
Updating display...
Display update finished (0 histograms, 0 events).
Updating display...
Display update finished (0 histograms, 0 events).
Updating display...
Display update finished (0 histograms, 0 events).
Updating display...

etc etc etc
ID: 37145 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1276
Credit: 8,481,307
RAC: 1,952
Message 37421 - Posted: 23 Nov 2018, 20:37:22 UTC
Last modified: 23 Nov 2018, 20:47:10 UTC

Not looping (yet) - When not endless, then running very long:
===> [runRivet] Fri Nov 23 17:46:52 CET 2018 [boinc ee zhad 197 - - sherpa 2.2.5 default 26000 2]
.
.
.
Starting the calculation at 17:47:38. Lean back and enjoy ... .
.
.
.
Starting the calculation at 17:47:59. Lean back and enjoy ... .
.
.
.
Starting the calculation at 17:50:51. Lean back and enjoy ... .
.
.
.
Starting the calculation at 18:12:57. Lean back and enjoy ... .
.
.
.
Starting the calculation at 18:13:11. Lean back and enjoy ... .
.
.
.
Starting the calculation at 18:24:20. Lean back and enjoy ... .
.
.
.

After that the time left is increasing and the end time is shifting into future.
last lines before a graceful shutdown:
Updating display...
Display update finished (0 histograms, 0 events).
3.0231e+13 pb +- ( 3.0231e+13 pb = 100 % ) 10360000 ( 10360002 -> 99.9 % )
integration time:  ( 1h 29m 10s elapsed / 632d 1h 25m left ) [21:19:55]   
3.01727e+13 pb +- ( 3.01727e+13 pb = 100 % ) 10380000 ( 10380002 -> 99.9 % )
integration time:  ( 1h 29m 21s elapsed / 633d 9h 13m 38s left ) [21:20:08]   
3.01147e+13 pb +- ( 3.01147e+13 pb = 100 % ) 10400000 ( 10400002 -> 99.9 % )
integration time:  ( 1h 29m 30s elapsed / 634d 10h 36m 6s left ) [21:20:17]   
3.00569e+13 pb +- ( 3.00569e+13 pb = 100 % ) 10420000 ( 10420002 -> 99.9 % )
integration time:  ( 1h 29m 40s elapsed / 635d 14h 16m left ) [21:20:28]   
2.99993e+13 pb +- ( 2.99993e+13 pb = 100 % ) 10440000 ( 10440002 -> 99.9 % )
integration time:  ( 1h 29m 50s elapsed / 636d 18h 8m 5s left ) [21:20:39]   
Updating display...
Display update finished (0 histograms, 0 events).
2.99419e+13 pb +- ( 2.99419e+13 pb = 100 % ) 10460000 ( 10460002 -> 99.9 % )
integration time:  ( 1h 30m 2s elapsed / 638d 2h 57m 32s left ) [21:20:51]   
2.98848e+13 pb +- ( 2.98848e+13 pb = 100 % ) 10480000 ( 10480002 -> 99.9 % )
integration time:  ( 1h 30m 13s elapsed / 639d 9h 50m 27s left ) [21:21:03]   
2.98279e+13 pb +- ( 2.98279e+13 pb = 100 % ) 10500000 ( 10500002 -> 99.9 % )
integration time:  ( 1h 30m 25s elapsed / 640d 19h 21m 33s left ) [21:21:16]
ID: 37421 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1276
Credit: 8,481,307
RAC: 1,952
Message 37430 - Posted: 24 Nov 2018, 11:51:41 UTC

Another monster Sherpa:

===> [runRivet] Sat Nov 24 08:29:45 CET 2018 [boinc pp jets 8000 350 - sherpa 1.4.5 default 33000 4]
.
.
.
Event 900 ( 1h 37m 31s elapsed / 2d 9h 58m 21s left ) -> ETA: Mon Nov 26 22:36
ID: 37430 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1276
Credit: 8,481,307
RAC: 1,952
Message 37446 - Posted: 28 Nov 2018, 10:24:37 UTC
Last modified: 29 Mar 2019, 18:25:01 UTC

Another sherpa almost endless (increasing time left):

===> [runRivet] Wed Nov 28 07:35:09 CET 2018 [boinc ee zhad 91.2 - - sherpa 1.4.1 default 100000 2]
.
.
.
Updating display...
Display update finished (0 histograms, 0 events).
1.82691e+18 pb +- ( 1.70885e+18 pb = 93.5376 % ) 42050000 ( 42050099 -> 99.9 % )
integration time: ( 3h 3m 4s elapsed / 1077d 10h 44m left )
1.82647e+18 pb +- ( 1.70844e+18 pb = 93.5376 % ) 42060000 ( 42060099 -> 99.9 % )
integration time: ( 3h 3m 6s elapsed / 1077d 15h 36m 36s left )
1.82604e+18 pb +- ( 1.70803e+18 pb = 93.5376 % ) 42070000 ( 42070099 -> 99.9 % )
integration time: ( 3h 3m 8s elapsed / 1077d 20h 26m 22s left )
1.8256e+18 pb +- ( 1.70763e+18 pb = 93.5376 % ) 42080000 ( 42080099 -> 99.9 % )
integration time: ( 3h 3m 10s elapsed / 1078d 1h 31m 46s left )
ID: 37446 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1276
Credit: 8,481,307
RAC: 1,952
Message 37590 - Posted: 11 Dec 2018, 8:03:34 UTC

Another sherpa with increasing days left. Processing jumps from 'full optimization' to 'integration'. Beam = ee

===> [runRivet] Mon Dec 10 15:38:08 CET 2018 [boinc ee zhad 133 - - sherpa 1.3.0 default 100000 0]
.
.
full optimization:  ( 48s elapsed / 26s left )   
546.759 pb +- ( 148.265 pb = 27.1171 % ) 210000 ( 210000 -> 100 % )
full optimization:  ( 50s elapsed / 24s left )   
543.655 pb +- ( 145.086 pb = 26.6871 % ) 220000 ( 220000 -> 100 % )
full optimization:  ( 53s elapsed / 21s left )   
539.299 pb +- ( 140.03 pb = 25.9651 % ) 240000 ( 240000 -> 100 % )
integration time:  ( 58s elapsed / 3h 32m 55s left )   
534.016 pb +- ( 135.076 pb = 25.2943 % ) 260000 ( 260000 -> 100 % )
integration time:  ( 1m 3s elapsed / 10h 18m 21s left )   
530.843 pb +- ( 131.425 pb = 24.7578 % ) 280000 ( 280000 -> 100 % )
integration time:  ( 1m 8s elapsed / 16h 3m 22s left )   
534.5 pb +- ( 130.827 pb = 24.4764 % ) 300000 ( 300000 -> 100 % )
integration time:  ( 1m 13s elapsed / 21h 26m 58s left )   
2.2056e+15 pb +- ( 2.2056e+15 pb = 100 % ) 310000 ( 310000 -> 100 % )
integration time:  ( 1m 16s elapsed / 2d 8h 52m 51s left )   
2.13208e+15 pb +- ( 2.13208e+15 pb = 100 % ) 320000 ( 320000 -> 100 % )
integration time:  ( 1m 18s elapsed / 2d 16h 3m 13s left )   
.
.
.
5.44822e+17 pb +- ( 1.27306e+17 pb = 23.3665 % ) 139480000 ( 139480249 -> 100 % )
integration time:  ( 8h 10m 11s elapsed / 184d 14h 57m 5s left )
ID: 37590 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1276
Credit: 8,481,307
RAC: 1,952
Message 38479 - Posted: 28 Mar 2019, 20:28:07 UTC

===> [runRivet] Thu Mar 28 19:32:28 CET 2019 [boinc ee zhad 43.6 - - sherpa 1.3.0 default 2000 32]
..
..
..
3.40571e+14 pb +- ( 3.40303e+14 pb = 99.9215 % ) 25180000 ( 25180001 -> 100 % )
integration time:  ( 1h 32m 26s elapsed / 605d 14h 43m 32s left )   
3.40298e+14 pb +- ( 3.40031e+14 pb = 99.9215 % ) 25200000 ( 25200001 -> 100 % )
integration time:  ( 1h 32m 30s elapsed / 606d 1h 45m 27s left )   
3.40026e+14 pb +- ( 3.39759e+14 pb = 99.9215 % ) 25220000 ( 25220001 -> 100 % )
integration time:  ( 1h 32m 34s elapsed / 606d 12h 34m 39s left )   
3.39754e+14 pb +- ( 3.39487e+14 pb = 99.9215 % ) 25240000 ( 25240001 -> 100 % )
integration time:  ( 1h 32m 38s elapsed / 606d 23h 57m 12s left )
ID: 38479 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2090
Credit: 158,958,761
RAC: 124,814
Message 38485 - Posted: 29 Mar 2019, 10:06:12 UTC - in response to Message 38479.  

Crystal,
FYI 238 hours for now. runRivet.log 2.3 MByte.
===> [runRivet] Tue Mar 19 11:36:56 UTC 2019 [boinc pp winclusive 7000 -,-,10 - sherpa 1.4.3 default 1000 32]
ID: 38485 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1276
Credit: 8,481,307
RAC: 1,952
Message 38488 - Posted: 29 Mar 2019, 17:27:40 UTC - in response to Message 38485.  

An hour ago I started a similar one:
===> [runRivet] Fri Mar 29 16:23:55 UTC 2019 [boinc pp winclusive 7000 -,-,10 - sherpa 1.4.5 default 1000 36]
When I only see

Updating display...
Display update finished (0 histograms, 0 events).


on and on, I'll abort the task after a while (a day?)
ID: 38488 · Report as offensive     Reply Quote
Profile Ray Murray
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 281
Credit: 11,859,285
RAC: 0
Message 38489 - Posted: 29 Mar 2019, 17:53:39 UTC
Last modified: 29 Mar 2019, 19:03:45 UTC

With newer versions of Sherpa available (2.2.5 is the highest I've seen), is there any benefit in continuing to release 1.4s and even 1.2s when it is clear, especially combined with an ee beam, that there is a problem somewhere? Solving such problems are beyond me but it's frustrating that looping and monster log files have been reported many times but we still get the same combinations coming out. Would those jobs fair better with the newer Sherpa or is there something about the combination that means that they just don't play well together and therefore shouldn't be so combined?

[Edit]
Reading further up the thread, I see that Peter answered this in May '18, but with the logs being lost when such jobs fail, timeout or are reset, there is no info returned to be analysed to discover WHY they fail and therefore prompt towards a solution. When I have caught one, and when I can be bothered, I have reported the job number and parameters line in the hope that those would be useful but I suspect there just isn't the staff available to be able to act on those reports and investgate further.
The incorporation Bronco's looper-catcher script, while not fixing the underlying issues, might mitigate some of the wasted cycles, increasing the throughput of useful work returned.
Perhaps something similar could be written to intercept those that exceed a set "time left" limit, eliminating those that claim to want to run for years, again reducing wasteage.
For the huge logfile ones, could they be set to terminate the JOB on reaching the limit rather killing the TASK?
ID: 38489 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Nov 14
Posts: 602
Credit: 24,371,321
RAC: 0
Message 38490 - Posted: 29 Mar 2019, 19:00:51 UTC - in response to Message 38489.  

Does Native Theory avoid this problem? I have seen some long ones, and have not been quite sure whether to abort them or not.
(How do I check?)
ID: 38490 · Report as offensive     Reply Quote
Profile Ray Murray
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 281
Credit: 11,859,285
RAC: 0
Message 38491 - Posted: 29 Mar 2019, 19:18:04 UTC - in response to Message 38490.  
Last modified: 29 Mar 2019, 19:39:35 UTC

I can't run Native myself with it being Linux whereas all my hosts are Windows but I think Crystal Pellet has encounterd a few.
The potential problem there is that whereas in an ordinary VM, the Task would timeout after 18-20 hours, in Native, the timeout seems to be >2 days, which is fine if all is well and it just happens to be a long job, but not so fine if it is a looper doing nothing useful.

I used to run Climate models that would take several months to complete but it was always obvious that the model was progressing and returning data in intermediate uploads. I would have no objection to running Tasks of similar duration here if I could be sure that it wasn't wasted effort.

[Aside]
When we ran the "Christmas Challenges" outside of Boinc, with no timeouts, did we only get Pythia jobs as I don't recall any issues of loopers (or indeed issues of any kind).
ID: 38491 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2402
Credit: 225,585,384
RAC: 121,146
Message 38492 - Posted: 29 Mar 2019, 19:29:45 UTC

Unlike Theory (vbox) Theory native has no hard 18h limit.
What remains is the due date set by the BOINC server.
This can't be set to infinite to catch other issues like non reponding hosts.
The task will continue on the host but will be treated as invalid when the client reports it.

Ray Murray wrote:
For the huge logfile ones, could they be set to terminate the JOB on reaching the limit rather killing the TASK?

+1
It would require a watchdog inside the VM that uploads the logs (maybe not the extra large ones) and initiates a graceful shutdown.


CP posted some numbers in the Theory nativ thread:
https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4979&postid=38451
It shows that most of the work should be able to finish within the standard time limits.

Nonetheless those failing task are annoying.


Best would of course be to analyse the reason for longrunners and huge logs.
ID: 38492 · Report as offensive     Reply Quote
bronco

Send message
Joined: 13 Apr 18
Posts: 443
Credit: 8,438,885
RAC: 0
Message 38493 - Posted: 30 Mar 2019, 1:53:23 UTC - in response to Message 38490.  

I have seen some long ones, and have not been quite sure whether to abort them or not.
(How do I check?)

If yours is a standard BOINC for Linux install then the task's running log is at /var/lib/boinc-client/<slot>/cernvm/shared/runRivet.log. You can open runRivet.log in a text editor if you want to view the entire log or you can use Linux's tail command to view the last x lines of the log 1 time. Combine the tail and watch commands to view the tail end repeatedly every x seconds.
Interpreting what you see in the log is not easy but if you follow the clues posted by Crystal Pellet and others in this thread you'll eventually get the hang of it. After the novelty wears off you'll eventually long for a script or app to do the work for you as it is tedious, monotonous and if you don't stay on task 24/7/365 loopers and long runners will slip by you.

I've been developing my watchdog script for over a year now in order to automate both the data gathering and the analysis. Trust me when I tell you there is no cut and dried sure fire way to determine if the job is viable. I have found that sometimes, a job that shows all the indications of being either a looper or a long runner progressing so slowly it cannot possibly complete in the time remaining, will suddenly stop looping and proceed or else speed up and complete. It's a judgement call based on more or less arbitrary criteria, as much an art as it is science.

The nice thing about Theory VBox is that you (or a watchdog scrip) can shutdown a looper or long runner gracefully so it still earns credits even if it does no useful work. Native Theory and native ATLAS tasks cannot be shutdown gracefully. They can only be aborted (which means no credits) so the temptation is to use more forgiving criteria and maybe let them run a little longer (in for a penny in for a pound as they say) but the question becomes "how long is the average volunteer prepared to let it run when it might run for several days and yield 0 credits. So there is the temptation to go NIMBY on it and just abort all sherpa jobs immediately. I don't like that approach but the option is there in my watchdog script anyway.

Another approach is to abort sherpas on the basis of the number of events the job is configured to process (the target events number). I adjusted my script to do that but as Crystal pointed out there is no correlation between the target events number and the job's likelihood of failure. So I think I'll be removing that part of the code.

I think the sherpa version number is a better (say more reliable but again not 100% reliable) criteria to determine chance of success so I'm in the process of adding that mechanism to my script. It's grown from a relatively easy to program/modify text mode app into a GUI app so users don't have to struggle with config files, command line options, etc. The problem is GUIs are much harder to program and debug. Recently I've taken to the notion that it should work across a LAN... a single GUI client that communicates with servers running on individual hosts similar to BOINC's monitor and client. BIG job for me.
ID: 38493 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Nov 14
Posts: 602
Credit: 24,371,321
RAC: 0
Message 38494 - Posted: 30 Mar 2019, 7:46:13 UTC - in response to Message 38493.  

Thanks. That all looks quite interesting, especially since I monitor my machines over a LAN. But I am also getting past (way past) any interest in babysitting my projects, and am looking to simplify my operations. Fortunately, Native Theory has behaved well the last couple of weeks, and maybe I will not need extreme measures. But if I do, it will probably be deselecting Theory.

But I am impressed with your work, and keep it up. There are people who will benefit.
ID: 38494 · Report as offensive     Reply Quote
bronco

Send message
Joined: 13 Apr 18
Posts: 443
Credit: 8,438,885
RAC: 0
Message 38496 - Posted: 30 Mar 2019, 13:42:50 UTC - in response to Message 38492.  

Unlike Theory (vbox) Theory native has no hard 18h limit.
What remains is the due date set by the BOINC server.
This can't be set to infinite to catch other issues like non reponding hosts.
The task will continue on the host but will be treated as invalid when the client reports it.

Ray Murray wrote:
For the huge logfile ones, could they be set to terminate the JOB on reaching the limit rather killing the TASK?

+1
It would require a watchdog inside the VM that uploads the logs (maybe not the extra large ones) and initiates a graceful shutdown.


CP posted some numbers in the Theory nativ thread:
https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4979&postid=38451
It shows that most of the work should be able to finish within the standard time limits.

Nonetheless those failing task are annoying.


Best would of course be to analyse the reason for longrunners and huge logs.

In a post from almost a year ago https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4028&postid=35142#35142, Peter Skands suggests some number of bugs in older versions of sherpa have been fixed but the older versions have not been deprecated and are still being used occasionally for reasons that make sense to me. Hopefully when native Theory transitions from this test phase to production, those older versions will be dropped. A watchdog running inside the container/VM could easily manage problems born of newer versions. A watchdog like mine runs outside the container/VM on the volunteer's account so it's limited in what it can do.
ID: 38496 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1276
Credit: 8,481,307
RAC: 1,952
Message 38527 - Posted: 4 Apr 2019, 17:16:03 UTC
Last modified: 4 Apr 2019, 17:29:59 UTC

I have to extend the 18 hours VM-run time to make this one a success.

===> [runRivet] Thu Apr 4 15:34:40 CEST 2019 [boinc pp jets 8000 350 - sherpa 1.4.3 default 18000 38]

                run 2279                 events attempts success failure lost
pp jets 8000 350 - sherpa 1.4.3 default  129000    19       8       1	  10

Busy at:
Event 400 ( 41m 22s elapsed / 1d 6h 20m 43s left ) -> ETA: Sat Apr 06 01:06
Event 500 ( 51m 43s elapsed / 1d 6h 10m 38s left ) -> ETA: Sat Apr 06 01:06

The time left since event 500 not increasing.

Ooops: Event 600 ( 1h 3m 42s elapsed / 1d 6h 47m 27s left ) -> ETA: Sat Apr 06 01:55
ID: 38527 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 · Next

Message boards : Theory Application : Theory's endless looping


©2024 CERN