21) Message boards : Number crunching : Host messing up tons of results (Message 27361)
Posted 10 Apr 2015 by Richard Haselgrove
Post:
Eric
so users do crunch x86 on their 64-bit PC right? isn't it possible incompatibility point?

I run SixTrack and other 32-bit BOINC project applications on 64-bit Windows 7 with no problem at all. If anyone is having problems with their SysWOW64 environment, it's specific to their computer, not general or widespread (Windows 10 - as yet unreleased - is another question entirely, of course).

32-bit Linux apps on a 64-bit system are more of a problem, because many projects rely on 32-bit compatibility libraries and not every user installs them. But that results in an immediate application crash, not the sort of pseudo-valid results that Eric is wrestling with.
22) Message boards : Number crunching : Host messing up tons of results (Message 27345)
Posted 8 Apr 2015 by Richard Haselgrove
Post:
The instruction manual for notices is

https://boinc.berkeley.edu/trac/wiki/ProjectNotices

It refers first of all to "your gui_urls.xml file": this would place (optional) project web page buttons on the left of our BOINC Manager screens, below the 'Command' buttons. You're not showing any optional buttons at the moment, so this project may not even have a gui_urls.xml file yet (or it may be empty). That would be step one: https://boinc.berkeley.edu/trac/wiki/GuiUrls.

After that, Notices appear to be linked to the procedure for putting news items and matching comment threads onto the front page - you have that authority already, so it should be relatively straightforward to find the 'export' button they refer to.
23) Message boards : Number crunching : Host messing up tons of results (Message 27341)
Posted 8 Apr 2015 by Richard Haselgrove
Post:
'Cancelled by server' can be either an automatic action - perhaps when a quorum partner returns their result, valid but late: if the replacement task hasn't been issued, or has been issued but not yet started by the recipient, it gets cancelled with no manual intervention. Or, as I suspect in this case, the server operators can cancel a whole batch of of WUs because they were configured wrongly or otherwise no longer needed. I believe it's generally easier to cancel all WUs in a batch, than to cancel all tasks sent to an individual user or host, but there may be some additional tools made available recently which could help. I'll try to look into those before Eric gets back from his vacation.

'Abandoned' is a different matter entirely. It's supposed to happen when a computer is detached from the project, and then re-attached - but there are indications that there is a deeply-buried bug somewhere in the server code which occasionally throws a whole batch of tasks away while the computer still thinks it's attached and is processing them. But that's one where we need help from a server administrator, rather than a project scientist like Eric.
24) Message boards : Number crunching : Host messing up tons of results (Message 27338)
Posted 8 Apr 2015 by Richard Haselgrove
Post:
My understanding is since host is in blacklist ALL assigned tasks have to be redeployed to other hosts and recrunched.
As I see my 3 of 3 tasks related to 9996388 now fresh in inconclusive list I assume all these hundreds of new tasks in 9996388's inconclusive list still rely on central.

No, blacklisting simply prevents new tasks being allocated - as per the mechanism "set max_results_day field to -1". There is supposedly an automatic quota system which - eventually - drags results per day down to 1. But the automatic system allows it to float back up again if tasks validate, and enough of 9996388's tasks validate to allow the work to keep flowing. With the special -1 setting, the validation of existing tasks won't allow quota to be increased automatically.
25) Message boards : Number crunching : Host messing up tons of results (Message 27336)
Posted 8 Apr 2015 by Richard Haselgrove
Post:
Don't worry about 9996388 - it hasn't received any new tasks since 7 Apr 2015, 11:46:43 UTC (yesterday) - previously it was grabbing new ones every minute.

If more inconclusive results are showing now than before, they will be previously 'validation pending' tasks which have now been tested against a wingmate - and found wanting.

Simply calling up that 14,000 task list took the server ages - let's hope that when the excessive task lists are finally purged from the system, MySQL will be able to run at normal speed again.

I agree that email and Notices are not mutually exclusive ways of getting the message out - by all means use both.
26) Message boards : Number crunching : Host messing up tons of results (Message 27332)
Posted 8 Apr 2015 by Richard Haselgrove
Post:
"BOINC Manager Notices
It's unfair to users just silently block their hosts. We need to explain and make some proposals how to fix it." AGREED.

I suspect that a directly targetted email to the host's owner would have more success than a Notice. I think we have to assume, almost by definition, that the owners of these machines don't pay much, if any, attention to BOINC - they may not even be running a version of BOINC which is capable of displaying notices, or the rogue results may be happening on a machine they don't regularly visit.
27) Message boards : Number crunching : Host messing up tons of results (Message 27313)
Posted 7 Apr 2015 by Richard Haselgrove
Post:
Great; except i don't have the tool nor the permission.
I'll get id done soonest thogh. Thanks a million. Eric.

At least it gives you a better idea of the message to pass to Cerberus!
28) Message boards : Number crunching : Host messing up tons of results (Message 27309)
Posted 7 Apr 2015 by Richard Haselgrove
Post:
There is no way I know of; I await my colleagues to fix
the null result problem, adjust outliers, and perhaps
ban the host. My bad WUs should be out of the way soon
and we shall be back to "normal". Eric.

"To blacklist a host, set its max_results_day field to -1."
29) Message boards : Number crunching : Host messing up tons of results (Message 27306)
Posted 7 Apr 2015 by Richard Haselgrove
Post:
Eric
Is any way to ban particular host, not the user? In this case all task assigned will be discarded indefinitely were they calculated on user's side or not?
In this case other hosts just perform as they do not aware of broken host and save us a fortune of time and energy)

Ah!

Eric, you're using the wrong tool!

If you look at host 9996388, the owner is shown as "(banished: ID 147506)". That's designed to block spammers and other nuisances from these message boards - it doesn't affect his computer processing.

Instead, you should be using Blacklisting hosts to stop the workflow to that host - and then lift the banishment, so that he can come here and talk to us about it!
30) Message boards : Number crunching : Available work? (Message 27292)
Posted 5 Apr 2015 by Richard Haselgrove
Post:
Yes, it's a complicated juggling act - made more so for users, because we only see the simplified "max # of error/total/success tasks", and I wasn't aware of some of the nuances of definition of some of those entries until today.

It looks as if https://boinc.berkeley.edu/trac/wiki/JobIn#scheduling is the place to be. I think I come to the conclusion - but I could well be wrong - that:

max_success_results (your resultsWithoutConcensus) should always be above min_quorum (your redundancy), but perhaps only by one - that would allow one inconclusive result, but not go on to allow an increased risk of two bad hosts validating each other. So these two values should march in step - if one changes, they both should be changed.

max_error_results (your errors) could perhaps come down a bit, depending on your observed sporadic error rate across the population of hosts. This value would be the one which kills a 'bad WU', in most cases, so it should be larger than the sporadic error rate by enough to avoid too many false triggers, but not by enough to allow bad WUs to clog the database.

If max_error_results and max_success_results are set properly, one or other of them will always come into play before max_total_results (your numIssues), so maybe Einstein's value of 20 isn't so far wrong after all. But I think max_total_results should be set >= max_error + max_success, else it will kick in prematurely at, say, one success plus four errors, or two inconclusive plus three errors.
31) Message boards : Number crunching : Available work? (Message 27290)
Posted 5 Apr 2015 by Richard Haselgrove
Post:
Just an example...
Einstein@home uses minimum quorum 2, maximum tasks - 20...
Believe or not, by this extremely high reserve, rarely but sometimes several WUs are getting status "to many errors"... ;-)

That approach works for Einstein, because they are performing long searches (many months) through large but consistent datasets. The preparation of individual workunits from those datasets can be automated, and very rarely results in significant numbers of 'impossible' tasks. When they do happen, it's usually because the staff are preparing a new search or application, and are paying close attention so they can catch problems with a small test batch quickly.

Here, I get the impression that there's more direct human input into the preparation of smaller production batches. It's still rare, but it's more likely that one whole batch will go wrong. With a very high maximum replication number, and occasional very long queues, any bad batch would recirculate through the system for a long time until the very last WU met its maker.

We certainly need a 'maximum tasks' limit high enough to allow for some errors to happen before quorum is reached, but as in life - moderation in all things.
32) Message boards : Number crunching : Available work? (Message 27274)
Posted 4 Apr 2015 by Richard Haselgrove
Post:
P.S. I seem to remember a comment about available work that I can't find
(03:00 here!). Work tends to come in batches and often different customers
at the same time. Since we have first in first out the different customers
then slow each other down in real time. I think we should rather run three
different studies consecutively, 1st user waits three days say, 2nd 6 days,
and 3rd 9 days. Is this better than 3 cutomers waiting 9 days? A classic
scheduling issue. I am thinking of introducing an additional layer of
control and monitoring between the customer and BOINC. This would allow,
customer to enquire about study progress, CERN speciific scheduling, and
better management of CERN local resources like disk space.
FIRST solve the annoying result differences!!!

I remember making a remark like that last time we had Lots of Work!

It was a fairly trivial throw-away comment at the time, but a few days later we had quite a big problem with Error reported by file upload server: Server is out of disk space. I can't help thinking the two events were related, and precisely for that reason - if a tie-breaker needs to be issued and goes to the end of a very long 'all researchers' combined queue, then all the result files for the workunit have to hang around for that much longer and clog up all the pipes.
33) Message boards : Number crunching : Host messing up tons of results (Message 27234)
Posted 28 Mar 2015 by Richard Haselgrove
Post:
It's more likely to be delayed, rather than wasted, if your machine normally produces valid results. 'Inconclusive' is not the end of the story.
34) Message boards : Number crunching : Retrospect: June, July, August 2013 WU? (Message 27211)
Posted 8 Mar 2015 by Richard Haselgrove
Post:
Well my birthday is 19th May; 9 years retired biut still a volunteer.
I'll see what i can do as, as you say, doen't enhance our
reputation. Eric.

If you spot any clues as to why the transitioner missed these tasks, could you pass them up the line, please? These isn't the only project with orphan ancient tasks clogging up the database.
35) Message boards : Number crunching : Available work? (Message 27206)
Posted 7 Mar 2015 by Richard Haselgrove
Post:
Thank you.

Cosmology@home has a priority of -1.00. I had to set LCH@home to 400 so that its priority reaches -0.92 (-0.49 at 200)! Why is that? And why are the priorities set to negative values?

Zero is a nice, fixed, high priority - it's defined as the maximum priority. Any negative priority is below that maximum. Projects closest to zero have highest priority, so will run/fetch work (there are actually two separate priority values for every project) first.

Priority is dynamic, and not the same thing as Resource Share.
36) Message boards : News : Server Intervention 10-Feb-2014 (Message 27165)
Posted 15 Feb 2015 by Richard Haselgrove
Post:
My priority now is to try and reduce inconclusives
by improving the banning of hosts with too many wrong
results. Eric.

Here's one for your list:

Host 9996388

What surprises me is how many have validated.
37) Message boards : Number crunching : Wrong applications sent to my computer? (Message 27163)
Posted 14 Feb 2015 by Richard Haselgrove
Post:
It depends how many of the 'wrong' application tasks you have completed. The BOINC server doesn't actually take any notice of the calculated values until you have 'completed' 11 tasks (strictly, "more than 10"). 'Completed', in this context, means that you have returned them with a 'success' outcome, and that they have been validated by your wingmate. Naturally, the short-running tasks tend to validate first, which gives an unfair boost to APR in situations like this: as longer-running tasks are validated from pending, APR will tend to fall.

If you go fishing for a new HostID number, remember that you will be abandoning the current APR calculated for x64_PNI, and all the others, too. The server will try your patience with trial runs of each of the application versions for the 'new' host too, before eventually settling down on the one it thinks your computer prefers.

I wouldn't bother - the workflow at this project is too variable and unpredictable. If there is a significant difference in speed between the application versions, the server will work it out in the end: if the difference is insignificant - well, then it doesn't matter after all.
38) Message boards : Number crunching : Wrong applications sent to my computer? (Message 27161)
Posted 14 Feb 2015 by Richard Haselgrove
Post:
No. For the purpose of this discussion, forget that the figures have any meaning at the individual instruction level, or inside the real silicon of a real CPU chip. These are just BOINC estimates and averages used for scheduling, nothing more.

Taking some figures from my i5 laptop, which has a measured floating point benchmark speed of 2857.28 million ops/sec (now that has some claim to be a real value):

I had seven tasks in progress just now, including two which had completed but not yet been reported.

Every task had an estimated 'size' of <rsc_fpops_est> 180000000000000 (1.8 x 10^14 floating point operations)

One of them finished in 4468.23 seconds. BOINC calculates a speed of 40.284 GFLOPS

Another finished in 168.23 seconds. BOINC calculates a speed of 1.069 TFLOPS.

Another is still running after 2 hours 35 minutes, and hasn't reached 20% yet. I reckon it's on target for 13.4 hours, or 48184 seconds. That would be a 'BOINC speed' of 3.735 GFLOPS.

BOINC will average all of those speeds (and many more runs besides) to come up with an APR for these tasks being run by this application version (the x86 SSE2, as it happens - current working estimate 7.14 GFLOPS). That will be used as a comparison with the other applications, measured empirically in the same way, purely to decide which application to send next time.
39) Message boards : Number crunching : Wrong applications sent to my computer? (Message 27159)
Posted 14 Feb 2015 by Richard Haselgrove
Post:
I'm getting tasks mostly allocated to the sse2 application too, but I think the scheduler - within its own limitations - is operating as it was designed to.

The BOINC server keeps track of the apparent efficiency of each application, on each computer. You can see the values on the Application details link for each host - that link is my i7.

The efficiency is expressed as the 'Average processing rate' (APR) in GFLOPS. At the time of writing, my i7 is showing:

32-bit apps
SSE2: 11.72 GFLOPS
PNI: 10.12 GFLOPS

64-bit apps
SSE2: 13.51 GFLOPS
PNI: 10.51 GFLOPS

(make sure you look at the current 451.07 version, that the bottom of the list, when checking your own values)

So, on the information available, the scheduler is correct in picking SSE2 for that host (and a few others I've checked).

So, why does the SSE2 app appear to be faster, when we all know it isn't in reality? It's all to do with the averaging - the server doesn't measure the speed of each application directly, but works it out from the size of the jobs being sent out, and the time they take to run.

For this particular project, there are two problems with this approach.

1) 'The size of the job'. This is declared by the scientist submitting the job, and known as <rsc_fpops_est> - it's also used to calculate the estimated runtime shown in the task list in BOINC Manager. I haven't been keeping detailed records, but I have a suspicion that not every job has had the appropriate <rsc_fpops_est> recently: if the estimate is too low, and the job runs longer than BOINC expects, then the speed appears low and the application less efficient.

2) 'The time they take to run'. As we know, LHC is looking for collider design parameters which result in stable orbits - and they are particularly interested in finding and eliminating instabilities which result in particles colliding with the tunnel wall or magnets. Better that virtual particles hit virtual walls in our computers, than in the real thing.

Maybe there were a batch of long-running tasks which drove down the APRs for SSE3/PNI, followed by a batch of tunnel-hitters just after we'd switched to SSE2? I can't be certain, but it's possible.

Within the limits of the current BOINC runtime-estimation tools, I'm not sure what the project can do about this. One thing would be to stress on the scientists the importance of checking and adjusting the <rsc_fpops_est> for the jobs they're submitting. Another possibility is marking the tunnel-hitters as 'Runtime outliers' (via the validator), so that a task which finishes early doesn't get taken to mean a super-fast processor or application.
40) Message boards : News : Server Intervention 10-Feb-2014 (Message 27143)
Posted 11 Feb 2015 by Richard Haselgrove
Post:
Yes, everything seems to be back in place from here, too. Thanks.


Previous 20 · Next 20


©2024 CERN