Message boards : Number crunching : 197 (0xc5) EXIT_TIME_LIMIT_EXCEEDED
Message board moderation

To post messages, you must log in.

AuthorMessage
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 798
Credit: 644,756,592
RAC: 232,533
Message 26449 - Posted: 14 May 2014, 22:52:39 UTC

I got a couple of WU's with this error:

http://lhcathomeclassic.cern.ch/sixtrack/result.php?resultid=36571379
http://lhcathomeclassic.cern.ch/sixtrack/result.php?resultid=36568253

There both on the same computer

Don't know if you can work anything out from logs?
ID: 26449 · Report as offensive     Reply Quote
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 857
Credit: 1,619,050
RAC: 0
Message 26458 - Posted: 16 May 2014, 10:22:36 UTC - in response to Message 26449.  

Thanks Toby; I am looking at this as well. Eric.
(P.S. I am now subscribed myself to the boinc-dev etc
mailing lists. Great and thanks.)
ID: 26458 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 27 Oct 07
Posts: 186
Credit: 3,297,640
RAC: 0
Message 26459 - Posted: 16 May 2014, 12:02:26 UTC - in response to Message 26458.  

I think there are probably three elements of your recent testing which are conspiring to confuse BOINC's scheduling.

1) Deployment of new application versions
2) Artificially shortened test tasks
3) Non-deterministic runtimes

Under the Runtime Estimation process associated with Credit New, each new app_version starts with a blank set of estimates.

If the early results for a particular host all have a short runtime (either because the workunit has been shortened without a corresponding reduction in rsc_fpops_est, or because the simulation hits the wall), the server accepts - without sanity checking, so far as I can tell - that the host is extraordinarily fast: this is visible as the 'APR' (Average Processing Rate) on the Application Details page for the host.

Once an initial baseline has been established (after 10 'completed' tasks), the APR is used to estimate the runtime of all future tasks.

If the host subsequently runs perfectly 'normal' tasks - correct rsc_fpops_est, and no early exit - the runtime is likely to exceed the rsc_fpops_bound at normal processing speeds.

The sledgehammer kludge is to increase rsc_fpops_bound for all workunits to 100x or even 1000x rsc_fpops_est, instead of the default 10x. This, of course, negates the purpose of rsc_fpops_bound, which is to catch and abort looping applications. But that's probably the lesser problem here.

The more complicated solution is to remember to adjust rsc_fpops_est for each different class of WUs - most critically, shortened test WUs (and doing urgent, repeated, tests is of course exactly when you least want to be fiddling with that. C'est la vie)

If you have any way of 'seeding' new application versions with tasks that don't crash out early, that would also help - but that rather defeats the purpose of the project.
ID: 26459 · Report as offensive     Reply Quote
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 798
Credit: 644,756,592
RAC: 232,533
Message 26460 - Posted: 16 May 2014, 12:27:54 UTC
Last modified: 16 May 2014, 12:29:41 UTC

Of my computers 4 are configured to be identical (as much as I can) I had one of those that had this error and the others didn't which is sort of strange.

The only hardware difference the PC in question has 3770T vs 3770S


Happy to help with debugging if I can.

Great info Richard!
ID: 26460 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 27 Oct 07
Posts: 186
Credit: 3,297,640
RAC: 0
Message 26461 - Posted: 16 May 2014, 13:47:01 UTC - in response to Message 26460.  

Your 3770T is host 9934731.

The application details for that host - newest at bottom - show the effect I was talking about:

SixTrack 451.07 windows_intelx86 (pni)
Average processing rate 98.413580511401

That's a single core running with SSE3 optimisation at near enough 100 GigaFlops. Or so BOINC thinks. Since this is more than 30 times the Whetstone benchmark of 3037.96 million ops/sec, the rsc_fpops_bound is extremely likely to kick in.

Unfortunately, since this is clearly (ha!) the most efficient application version for your host, the server will preferentially send tasks tagged for this app_version - and they will continue to fail. I don't think there's anything a volunteer can do, client_side, to escape from this catch-22: the quickest way out is to declare yourself a cheater by fiddling with rpc_seqno, and thus get a new HostID assigned. And hope that you get some moderate runtime tasks allocated in the early stages, so you seed APR with some sane values.

If, by the luck of the draw, you happen to be assigned some short, but not excessively short, tasks, APR will be adjusted downwards on each successful completion, and you might eventually dig yourself out of the trap - but that's not guaranteed.

The highest APR I can find for any of your 3770S hosts is 76.58 for host 9961528. You must have just scraped under the wire with that one - a more serendipitous mix of WUs.
ID: 26461 · Report as offensive     Reply Quote
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 857
Credit: 1,619,050
RAC: 0
Message 26462 - Posted: 16 May 2014, 15:26:49 UTC - in response to Message 26459.  

Very helpful; thanks Richard.
I should be able to so something about this. Eric.
ID: 26462 · Report as offensive     Reply Quote
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 798
Credit: 644,756,592
RAC: 232,533
Message 26463 - Posted: 16 May 2014, 23:00:31 UTC

Richard, you know BOINC very very well!

Seems like adjusting the rsc_fpops_est for each app version isn't too bad an option assuming you know what to set it too?

ID: 26463 · Report as offensive     Reply Quote
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 857
Credit: 1,619,050
RAC: 0
Message 26464 - Posted: 17 May 2014, 1:12:30 UTC - in response to Message 26463.  

Very very well indeed. Sadly, I do adjust the fpops estimate for each
case but based on the maximum number of turns. When chaos sets
in the particles are lost and at higher amplitudes we may not even
complete any turns! as I am seeing right now. However I can certainly
avoid the very very short tests. (Remember what we are looking for
is the boundary where chaos sets in.) Still I shall look at this as I said.
Eric.
ID: 26464 · Report as offensive     Reply Quote
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 857
Credit: 1,619,050
RAC: 0
Message 26465 - Posted: 17 May 2014, 9:53:08 UTC - in response to Message 26462.  

and a MYSQL expert as well? :-)

My code is:
# Getting the fpops estimate and multiplying it with 10 to get the bound.
# Multiply by 4 for the moment but watch for complaints!
# Try six for pentium 4
fourtimes=$6
fourtimes=`expr $fourtimes \* 6`
fpopsEstimate=$fourtimes
fpopsBound=`expr $fpopsEstimate \* 10`

where $6 is $sixdeskfpopse

sixdeskfpopse=`expr $sixdeskturns \* $sixdeskparts`
sixdeskfpopse=`expr $sixdeskfpopse / 2`
export sixdeskfpopse=$sixdeskfpopse"000000"

The max work done is the max turns by number of particles.

As analysed by Richard seems to be OK for "production" where we typically do
100,000 or 1,000,000 turns even if particles are lost early at high amplitudes.

So first I shall make longer tests; I didn't want to waste CPU when just checking
the chain of submit, execute and return results. I was using 1000 turns.
I can go to 10,000 maybe......I could also choose a smaller number if this
is a sixtracktest, but what? I cannot predict stability or chaos that is what we
are trying to determine.

This also explains why I saw most failures on the pni/sse3 executables.

Ideally I would reset this value when all particles are lost but I guess there
is no interface to BOINC for that.

See what I can do. Eric.


ID: 26465 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 27 Oct 07
Posts: 186
Credit: 3,297,640
RAC: 0
Message 26467 - Posted: 17 May 2014, 12:47:37 UTC - in response to Message 26463.  

Richard, you know BOINC very very well!

Thanks both. I simply spend too long watching BOINC at work - I need to get out more!

Seems like adjusting the rsc_fpops_est for each app version isn't too bad an option assuming you know what to set it too?

It's not the app_version which determines rsc_fpops_est, but the varying number of collider orbits - the "max work done", max turns multipled by number of particles, as described by Eric. Provided rsc_fpops_est is proportional to that, we're in business.

I mentioned the 'shortened test work' problem because of recent experience at another project - a developer sent out 1-minute test jobs, while leaving rsc_fpops_est at their standard value appropriate for 10-hour jobs. Ooops, chaos (in the non-mathematical sense of the word).

It would be helpful if the early tasks sent out after a version change could, as far as possible, lie some way south of the chaos boundary: not only each individual host record, but the project as a whole, maintains averages for speed and runtime, and these can and will be poisoned if there are too many early exits in the initial stages.

After that, BOINC does in fact provide a feature which we can use. I discovered this morning that it's completely undocumented, but we did actually touch on it here a couple of years ago: message 24418

If you can arrange for the validator to set the 'runtime_outlier' flag for tasks where a significant proportion of the simulated particles fail to complete the expected number of turns, BOINC won't use those values to update the averages. That should help to keep the server estimates stable.

The source code link I provided last time no longer works, but you can read David's explanation in http://boinc.berkeley.edu/gitweb/?p=boinc-v2.git;a=commit;h=e49f9459080b488f85fbcf8cdad6db9672416cf8
ID: 26467 · Report as offensive     Reply Quote
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 857
Credit: 1,619,050
RAC: 0
Message 26470 - Posted: 17 May 2014, 13:37:48 UTC - in response to Message 26467.  

Great stuff; for the tests I KNOW the results and the correct
number of turns to be performed. I am checking that I get
the same results with the test runs. We are in production so I
won't touch anything right now but my next test runs will
use one of your proposed solutions.

Now for the "Create Process" issue.

I agree we should get out more but I shall stay in to watch the
Cup Final.

A problem identified is a problem solved!

Many many thanks. Eric.
ID: 26470 · Report as offensive     Reply Quote
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 798
Credit: 644,756,592
RAC: 232,533
Message 26472 - Posted: 17 May 2014, 17:45:54 UTC

Great job debugging problems :)

If you know it can be 1000 turns max then you can adjust accordingly.

The outlier result is a great feature for projects like this too as there is significant amplitude in the expected results, this could help with Eric quest for mathematic consistency by highlighting interesting results.
ID: 26472 · Report as offensive     Reply Quote
Profile Sid
Avatar

Send message
Joined: 20 Jul 07
Posts: 43
Credit: 367,186
RAC: 0
Message 26494 - Posted: 19 May 2014, 12:34:18 UTC - in response to Message 26449.  

I got a couple of WU's with this error:

http://lhcathomeclassic.cern.ch/sixtrack/result.php?resultid=36571379
http://lhcathomeclassic.cern.ch/sixtrack/result.php?resultid=36568253




I have you beat: 15 wu's fail after running roughly 7 hours. Apparently all victims of the SSE3 optimisation issue running on 2 i7's.

. . . needless to say, I'm not a happy camper.

LHC: The Essential Guide Part 2
ID: 26494 · Report as offensive     Reply Quote
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 857
Credit: 1,619,050
RAC: 0
Message 26496 - Posted: 19 May 2014, 14:07:00 UTC - in response to Message 26494.  

Understood; I hope I have fixed it; we shall see.
I'll figure out some compensation! A T-shirt?
(but I can't do that for everybody) Eric.
ID: 26496 · Report as offensive     Reply Quote
Profile Sid
Avatar

Send message
Joined: 20 Jul 07
Posts: 43
Credit: 367,186
RAC: 0
Message 26498 - Posted: 19 May 2014, 15:43:59 UTC - in response to Message 26496.  

Understood; I hope I have fixed it; we shall see.
I'll figure out some compensation! A T-shirt?
(but I can't do that for everybody) Eric.


Awarding the credit would do. . . .


LHC: The Essential Guide Part 2
ID: 26498 · Report as offensive     Reply Quote
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 857
Credit: 1,619,050
RAC: 0
Message 26501 - Posted: 19 May 2014, 19:19:25 UTC - in response to Message 26498.  

I'll try........Eric.
ID: 26501 · Report as offensive     Reply Quote
tullio

Send message
Joined: 19 Feb 08
Posts: 708
Credit: 4,336,250
RAC: 0
Message 26505 - Posted: 20 May 2014, 7:41:39 UTC

One task on my main SuSE Linux host with sse2 was aborted by the server with this motivation. The other, with pni, is awaiting validation after a 37 hours run. So I stopped LHC work on my main host and started it on my Ubuntu Virtual Machine, where Test4Theory@home is running with good results. Wait and see.
Tullio
ID: 26505 · Report as offensive     Reply Quote

Message boards : Number crunching : 197 (0xc5) EXIT_TIME_LIMIT_EXCEEDED


©2024 CERN