Thread '197 (0xc5) EXIT_TIME_LIMIT

Author	Message
Toby Broom Volunteer moderator Send message Joined: 27 Sep 08 Posts: 939 Credit: 781,711,177 RAC: 78,084	Message 26449 - Posted: 14 May 2014, 22:52:39 UTC I got a couple of WU's with this error: http://lhcathomeclassic.cern.ch/sixtrack/result.php?resultid=36571379 http://lhcathomeclassic.cern.ch/sixtrack/result.php?resultid=36568253 There both on the same computer Don't know if you can work anything out from logs? ID: 26449 · Reply Quote

Eric Mcintosh Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0	Message 26458 - Posted: 16 May 2014, 10:22:36 UTC - in response to Message 26449. Thanks Toby; I am looking at this as well. Eric. (P.S. I am now subscribed myself to the boinc-dev etc mailing lists. Great and thanks.) ID: 26458 · Reply Quote

Richard Haselgrove Send message Joined: 27 Oct 07 Posts: 186 Credit: 3,297,640 RAC: 0	Message 26459 - Posted: 16 May 2014, 12:02:26 UTC - in response to Message 26458. I think there are probably three elements of your recent testing which are conspiring to confuse BOINC's scheduling. 1) Deployment of new application versions 2) Artificially shortened test tasks 3) Non-deterministic runtimes Under the Runtime Estimation process associated with Credit New, each new app_version starts with a blank set of estimates. If the early results for a particular host all have a short runtime (either because the workunit has been shortened without a corresponding reduction in rsc_fpops_est, or because the simulation hits the wall), the server accepts - without sanity checking, so far as I can tell - that the host is extraordinarily fast: this is visible as the 'APR' (Average Processing Rate) on the Application Details page for the host. Once an initial baseline has been established (after 10 'completed' tasks), the APR is used to estimate the runtime of all future tasks. If the host subsequently runs perfectly 'normal' tasks - correct rsc_fpops_est, and no early exit - the runtime is likely to exceed the rsc_fpops_bound at normal processing speeds. The sledgehammer kludge is to increase rsc_fpops_bound for all workunits to 100x or even 1000x rsc_fpops_est, instead of the default 10x. This, of course, negates the purpose of rsc_fpops_bound, which is to catch and abort looping applications. But that's probably the lesser problem here. The more complicated solution is to remember to adjust rsc_fpops_est for each different class of WUs - most critically, shortened test WUs (and doing urgent, repeated, tests is of course exactly when you least want to be fiddling with that. C'est la vie) If you have any way of 'seeding' new application versions with tasks that don't crash out early, that would also help - but that rather defeats the purpose of the project. ID: 26459 · Reply Quote

Toby Broom Volunteer moderator Send message Joined: 27 Sep 08 Posts: 939 Credit: 781,711,177 RAC: 78,084	Message 26460 - Posted: 16 May 2014, 12:27:54 UTC Last modified: 16 May 2014, 12:29:41 UTC Of my computers 4 are configured to be identical (as much as I can) I had one of those that had this error and the others didn't which is sort of strange. The only hardware difference the PC in question has 3770T vs 3770S Happy to help with debugging if I can. Great info Richard! ID: 26460 · Reply Quote

Richard Haselgrove Send message Joined: 27 Oct 07 Posts: 186 Credit: 3,297,640 RAC: 0	Message 26461 - Posted: 16 May 2014, 13:47:01 UTC - in response to Message 26460. Your 3770T is host 9934731. The application details for that host - newest at bottom - show the effect I was talking about: SixTrack 451.07 windows_intelx86 (pni) Average processing rate 98.413580511401 That's a single core running with SSE3 optimisation at near enough 100 GigaFlops. Or so BOINC thinks. Since this is more than 30 times the Whetstone benchmark of 3037.96 million ops/sec, the rsc_fpops_bound is extremely likely to kick in. Unfortunately, since this is clearly (ha!) the most efficient application version for your host, the server will preferentially send tasks tagged for this app_version - and they will continue to fail. I don't think there's anything a volunteer can do, client_side, to escape from this catch-22: the quickest way out is to declare yourself a cheater by fiddling with rpc_seqno, and thus get a new HostID assigned. And hope that you get some moderate runtime tasks allocated in the early stages, so you seed APR with some sane values. If, by the luck of the draw, you happen to be assigned some short, but not excessively short, tasks, APR will be adjusted downwards on each successful completion, and you might eventually dig yourself out of the trap - but that's not guaranteed. The highest APR I can find for any of your 3770S hosts is 76.58 for host 9961528. You must have just scraped under the wire with that one - a more serendipitous mix of WUs. ID: 26461 · Reply Quote

Eric Mcintosh Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0	Message 26462 - Posted: 16 May 2014, 15:26:49 UTC - in response to Message 26459. Very helpful; thanks Richard. I should be able to so something about this. Eric. ID: 26462 · Reply Quote

Toby Broom Volunteer moderator Send message Joined: 27 Sep 08 Posts: 939 Credit: 781,711,177 RAC: 78,084	Message 26463 - Posted: 16 May 2014, 23:00:31 UTC Richard, you know BOINC very very well! Seems like adjusting the rsc_fpops_est for each app version isn't too bad an option assuming you know what to set it too? ID: 26463 · Reply Quote

Eric Mcintosh Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0	Message 26464 - Posted: 17 May 2014, 1:12:30 UTC - in response to Message 26463. Very very well indeed. Sadly, I do adjust the fpops estimate for each case but based on the maximum number of turns. When chaos sets in the particles are lost and at higher amplitudes we may not even complete any turns! as I am seeing right now. However I can certainly avoid the very very short tests. (Remember what we are looking for is the boundary where chaos sets in.) Still I shall look at this as I said. Eric. ID: 26464 · Reply Quote

Eric Mcintosh Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0	Message 26465 - Posted: 17 May 2014, 9:53:08 UTC - in response to Message 26462. and a MYSQL expert as well? :-) My code is: # Getting the fpops estimate and multiplying it with 10 to get the bound. # Multiply by 4 for the moment but watch for complaints! # Try six for pentium 4 fourtimes=$6 fourtimes=`expr $fourtimes \* 6` fpopsEstimate=$fourtimes fpopsBound=`expr $fpopsEstimate \* 10` where $6 is $sixdeskfpopse sixdeskfpopse=`expr $sixdeskturns \* $sixdeskparts` sixdeskfpopse=`expr $sixdeskfpopse / 2` export sixdeskfpopse=$sixdeskfpopse"000000" The max work done is the max turns by number of particles. As analysed by Richard seems to be OK for "production" where we typically do 100,000 or 1,000,000 turns even if particles are lost early at high amplitudes. So first I shall make longer tests; I didn't want to waste CPU when just checking the chain of submit, execute and return results. I was using 1000 turns. I can go to 10,000 maybe......I could also choose a smaller number if this is a sixtracktest, but what? I cannot predict stability or chaos that is what we are trying to determine. This also explains why I saw most failures on the pni/sse3 executables. Ideally I would reset this value when all particles are lost but I guess there is no interface to BOINC for that. See what I can do. Eric. ID: 26465 · Reply Quote

Richard Haselgrove Send message Joined: 27 Oct 07 Posts: 186 Credit: 3,297,640 RAC: 0	Message 26467 - Posted: 17 May 2014, 12:47:37 UTC - in response to Message 26463. Richard, you know BOINC very very well! Thanks both. I simply spend too long watching BOINC at work - I need to get out more! Seems like adjusting the rsc_fpops_est for each app version isn't too bad an option assuming you know what to set it too? It's not the app_version which determines rsc_fpops_est, but the varying number of collider orbits - the "max work done", max turns multipled by number of particles, as described by Eric. Provided rsc_fpops_est is proportional to that, we're in business. I mentioned the 'shortened test work' problem because of recent experience at another project - a developer sent out 1-minute test jobs, while leaving rsc_fpops_est at their standard value appropriate for 10-hour jobs. Ooops, chaos (in the non-mathematical sense of the word). It would be helpful if the early tasks sent out after a version change could, as far as possible, lie some way south of the chaos boundary: not only each individual host record, but the project as a whole, maintains averages for speed and runtime, and these can and will be poisoned if there are too many early exits in the initial stages. After that, BOINC does in fact provide a feature which we can use. I discovered this morning that it's completely undocumented, but we did actually touch on it here a couple of years ago: message 24418 If you can arrange for the validator to set the 'runtime_outlier' flag for tasks where a significant proportion of the simulated particles fail to complete the expected number of turns, BOINC won't use those values to update the averages. That should help to keep the server estimates stable. The source code link I provided last time no longer works, but you can read David's explanation in http://boinc.berkeley.edu/gitweb/?p=boinc-v2.git;a=commit;h=e49f9459080b488f85fbcf8cdad6db9672416cf8 ID: 26467 · Reply Quote

Eric Mcintosh Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0	Message 26470 - Posted: 17 May 2014, 13:37:48 UTC - in response to Message 26467. Great stuff; for the tests I KNOW the results and the correct number of turns to be performed. I am checking that I get the same results with the test runs. We are in production so I won't touch anything right now but my next test runs will use one of your proposed solutions. Now for the "Create Process" issue. I agree we should get out more but I shall stay in to watch the Cup Final. A problem identified is a problem solved! Many many thanks. Eric. ID: 26470 · Reply Quote

Toby Broom Volunteer moderator Send message Joined: 27 Sep 08 Posts: 939 Credit: 781,711,177 RAC: 78,084	Message 26472 - Posted: 17 May 2014, 17:45:54 UTC Great job debugging problems :) If you know it can be 1000 turns max then you can adjust accordingly. The outlier result is a great feature for projects like this too as there is significant amplitude in the expected results, this could help with Eric quest for mathematic consistency by highlighting interesting results. ID: 26472 · Reply Quote

Sid Send message Joined: 20 Jul 07 Posts: 43 Credit: 367,186 RAC: 0	Message 26494 - Posted: 19 May 2014, 12:34:18 UTC - in response to Message 26449. I got a couple of WU's with this error: http://lhcathomeclassic.cern.ch/sixtrack/result.php?resultid=36571379 http://lhcathomeclassic.cern.ch/sixtrack/result.php?resultid=36568253 I have you beat: 15 wu's fail after running roughly 7 hours. Apparently all victims of the SSE3 optimisation issue running on 2 i7's. . . . needless to say, I'm not a happy camper. LHC: The Essential Guide Part 2 ID: 26494 · Reply Quote

Eric Mcintosh Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0	Message 26496 - Posted: 19 May 2014, 14:07:00 UTC - in response to Message 26494. Understood; I hope I have fixed it; we shall see. I'll figure out some compensation! A T-shirt? (but I can't do that for everybody) Eric. ID: 26496 · Reply Quote

Sid Send message Joined: 20 Jul 07 Posts: 43 Credit: 367,186 RAC: 0	Message 26498 - Posted: 19 May 2014, 15:43:59 UTC - in response to Message 26496. Understood; I hope I have fixed it; we shall see. I'll figure out some compensation! A T-shirt? (but I can't do that for everybody) Eric. Awarding the credit would do. . . . LHC: The Essential Guide Part 2 ID: 26498 · Reply Quote

Eric Mcintosh Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0	Message 26501 - Posted: 19 May 2014, 19:19:25 UTC - in response to Message 26498. I'll try........Eric. ID: 26501 · Reply Quote

tullio Send message Joined: 19 Feb 08 Posts: 708 Credit: 4,336,250 RAC: 0	Message 26505 - Posted: 20 May 2014, 7:41:39 UTC One task on my main SuSE Linux host with sse2 was aborted by the server with this motivation. The other, with pni, is awaiting validation after a 37 hours run. So I stopped LHC work on my main host and started it on my Ubuntu Virtual Machine, where Test4Theory@home is running with good results. Wait and see. Tullio ID: 26505 · Reply Quote