Message boards : Number crunching : Available work?
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · Next

AuthorMessage
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 857
Credit: 1,619,050
RAC: 0
Message 27271 - Posted: 4 Apr 2015, 0:59:55 UTC - in response to Message 27269.  

Trying out 3 rights sounds like a good plan to test. However, don't you still need to have a total of 5, like we used to have? Right now (if I understand it correctly) having a total of 3 means that if 1 fails, it will not send out new and noone will get credit? Or is it just me reading it wrong?

So it will be 3/5/3 (3 wrong, 5 total, 3 right). Then the least resource will be lost and it will still have a chance to validate even though 1 of the 3 are wrong.


Almost, if I am right, if the three don't match an additional WU is put to
the end of the queue :-( and should be assigned to a "reliable" host.
So unless I get three empty/null results, I should get a good result, or by
a miracle I may get the same bad result from three hosts and that would be
a really good clue. Eric.
P.S. I seem to remember a comment about available work that I can't find
(03:00 here!). Work tends to come in batches and often different customers
at the same time. Since we have first in first out the different customers
then slow each other down in real time. I think we should rather run three
different studies consecutively, 1st user waits three days say, 2nd 6 days,
and 3rd 9 days. Is this better than 3 cutomers waiting 9 days? A classic
scheduling issue. I am thinking of introducing an additional layer of
control and monitoring between the customer and BOINC. This would allow,
customer to enquire about study progress, CERN speciific scheduling, and
better management of CERN local resources like disk space.
FIRST solve the annoying result differences!!!
ID: 27271 · Report as offensive     Reply Quote
alvin
Avatar

Send message
Joined: 12 Mar 12
Posts: 128
Credit: 20,013,377
RAC: 0
Message 27272 - Posted: 4 Apr 2015, 2:18:20 UTC - in response to Message 27271.  
Last modified: 4 Apr 2015, 2:20:19 UTC

Interestingly I see significant drop in "inconclusive" results last days - only 2-4 per day, not scores as before.
ID: 27272 · Report as offensive     Reply Quote
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 857
Credit: 1,619,050
RAC: 0
Message 27273 - Posted: 4 Apr 2015, 5:11:48 UTC - in response to Message 27272.  

Remember I suspended a big culprit, but it takes time to become
effective. Expect a real fix soonest. Eric.
ID: 27273 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 27 Oct 07
Posts: 186
Credit: 3,297,640
RAC: 0
Message 27274 - Posted: 4 Apr 2015, 6:45:02 UTC - in response to Message 27271.  

P.S. I seem to remember a comment about available work that I can't find
(03:00 here!). Work tends to come in batches and often different customers
at the same time. Since we have first in first out the different customers
then slow each other down in real time. I think we should rather run three
different studies consecutively, 1st user waits three days say, 2nd 6 days,
and 3rd 9 days. Is this better than 3 cutomers waiting 9 days? A classic
scheduling issue. I am thinking of introducing an additional layer of
control and monitoring between the customer and BOINC. This would allow,
customer to enquire about study progress, CERN speciific scheduling, and
better management of CERN local resources like disk space.
FIRST solve the annoying result differences!!!

I remember making a remark like that last time we had Lots of Work!

It was a fairly trivial throw-away comment at the time, but a few days later we had quite a big problem with Error reported by file upload server: Server is out of disk space. I can't help thinking the two events were related, and precisely for that reason - if a tie-breaker needs to be issued and goes to the end of a very long 'all researchers' combined queue, then all the result files for the workunit have to hang around for that much longer and clog up all the pipes.
ID: 27274 · Report as offensive     Reply Quote
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 857
Credit: 1,619,050
RAC: 0
Message 27276 - Posted: 4 Apr 2015, 16:20:12 UTC - in response to Message 27274.  

Right; in the meantime I have just found a bug with my gfortzran
version, either in gcc or the gcc/gfortran interface. I am not pursuing
it because production is with ifort and all works fine with older
versions of gcc/gfortran. I am really praying that my current Easter
tests will give insight to the current ifort production invalid results.
"Only" a very small percentage of WUs, but really annoying. Eric.

(One day I hope to get SixTrack into SpecFP tests again or at least
make it freely available as a compiler/hardware test program.)
ID: 27276 · Report as offensive     Reply Quote
Profile Grubix

Send message
Joined: 3 Jul 08
Posts: 20
Credit: 8,281,604
RAC: 0
Message 27277 - Posted: 4 Apr 2015, 19:10:57 UTC - in response to Message 27269.  

... having a total of 3 means that if 1 fails, it will not send out new and noone will get credit?

I think you mean about this: 30565542

I am one of the two computers with 17000 seconds. The faulty computer 10353061 nearly broken all tasks. This computer has a "EXIT_TIME_LIMIT_EXCEEDED" error after exactly 252.06 or 2520.60 seconds on all tasks. Unusual.

Bye, Grubix.


PS: I'm not so interested in credits, but in science. :-)
ID: 27277 · Report as offensive     Reply Quote
Uffe F

Send message
Joined: 9 Jan 08
Posts: 66
Credit: 727,923
RAC: 0
Message 27278 - Posted: 4 Apr 2015, 19:35:25 UTC - in response to Message 27277.  

Jep. That was what I was thinking about. Same here, in it for the Science and not the credits. I just fear that if a task gets rejected it's lost work even though 2 computers came with the correct answer.
ID: 27278 · Report as offensive     Reply Quote
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 857
Credit: 1,619,050
RAC: 0
Message 27279 - Posted: 4 Apr 2015, 20:00:07 UTC - in response to Message 27278.  

Well thanks for your patience and support. I reckon I messed up
again. The error limit of 3 is clearly wrong. Should be much bigger,
like 10 say. I just tried to change it and I am not allowed to.
I don't know why. I changed the file just yesterday. However,
ls -l shows it hasn't changed since last October. i
Need to try again later. Eric.
ID: 27279 · Report as offensive     Reply Quote
metalius
Avatar

Send message
Joined: 3 Oct 06
Posts: 101
Credit: 8,985,206
RAC: 43
Message 27289 - Posted: 5 Apr 2015, 9:27:16 UTC

Just an example...
Einstein@home uses minimum quorum 2, maximum tasks - 20...
Believe or not, by this extremely high reserve, rarely but sometimes several WUs are getting status "to many errors"... ;-)
ID: 27289 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 27 Oct 07
Posts: 186
Credit: 3,297,640
RAC: 0
Message 27290 - Posted: 5 Apr 2015, 10:10:11 UTC - in response to Message 27289.  

Just an example...
Einstein@home uses minimum quorum 2, maximum tasks - 20...
Believe or not, by this extremely high reserve, rarely but sometimes several WUs are getting status "to many errors"... ;-)

That approach works for Einstein, because they are performing long searches (many months) through large but consistent datasets. The preparation of individual workunits from those datasets can be automated, and very rarely results in significant numbers of 'impossible' tasks. When they do happen, it's usually because the staff are preparing a new search or application, and are paying close attention so they can catch problems with a small test batch quickly.

Here, I get the impression that there's more direct human input into the preparation of smaller production batches. It's still rare, but it's more likely that one whole batch will go wrong. With a very high maximum replication number, and occasional very long queues, any bad batch would recirculate through the system for a long time until the very last WU met its maker.

We certainly need a 'maximum tasks' limit high enough to allow for some errors to happen before quorum is reached, but as in life - moderation in all things.
ID: 27290 · Report as offensive     Reply Quote
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 857
Credit: 1,619,050
RAC: 0
Message 27291 - Posted: 5 Apr 2015, 10:27:10 UTC - in response to Message 27290.  

Thanks Richard; for production we use as shown below.
I really goofed up with my test runs. Eric.

# The number of redundant calculations. Set to two or more to achieve redudancy.
redundancy=2

# The number of copies of the workunit to issue to clients. Must be at least the
# number of redundant calculations or higher if a loss of results is expected
# or if the result should to be obtained fast.
copies=2

# The number of errors from clients before the workunit is declared to have
# an error.
errors=5

# The total number of clients to issue the workunit to before it is declared
# to have an error.
numIssues=5

# The total number of returned results without a concensus is found before the
# workunit is declared to have an error.
resultsWithoutConcensus=3
ID: 27291 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 27 Oct 07
Posts: 186
Credit: 3,297,640
RAC: 0
Message 27292 - Posted: 5 Apr 2015, 13:31:19 UTC - in response to Message 27291.  

Yes, it's a complicated juggling act - made more so for users, because we only see the simplified "max # of error/total/success tasks", and I wasn't aware of some of the nuances of definition of some of those entries until today.

It looks as if https://boinc.berkeley.edu/trac/wiki/JobIn#scheduling is the place to be. I think I come to the conclusion - but I could well be wrong - that:

max_success_results (your resultsWithoutConcensus) should always be above min_quorum (your redundancy), but perhaps only by one - that would allow one inconclusive result, but not go on to allow an increased risk of two bad hosts validating each other. So these two values should march in step - if one changes, they both should be changed.

max_error_results (your errors) could perhaps come down a bit, depending on your observed sporadic error rate across the population of hosts. This value would be the one which kills a 'bad WU', in most cases, so it should be larger than the sporadic error rate by enough to avoid too many false triggers, but not by enough to allow bad WUs to clog the database.

If max_error_results and max_success_results are set properly, one or other of them will always come into play before max_total_results (your numIssues), so maybe Einstein's value of 20 isn't so far wrong after all. But I think max_total_results should be set >= max_error + max_success, else it will kick in prematurely at, say, one success plus four errors, or two inconclusive plus three errors.
ID: 27292 · Report as offensive     Reply Quote
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 857
Credit: 1,619,050
RAC: 0
Message 27293 - Posted: 5 Apr 2015, 20:17:50 UTC

I have of course stopped submitting WUs. My cases are intermixed
with regular production and while I could identify mw WUs I don't
know how to delete them other than one at a time (or by range) :-(.
On the other hand as planned I am getting valuable information
back about Validated but wrong results as well as more info about the
empty results files. Tomorrow I will try and identify my WUs and
perhaps find a finite number of ranges to delete.....

With the help of my colleague Alex we are very close to identifying
the gcc problem. Eric.
ID: 27293 · Report as offensive     Reply Quote
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 857
Credit: 1,619,050
RAC: 0
Message 27294 - Posted: 6 Apr 2015, 2:02:05 UTC

Well, first beams circulated today!

We currently have some 300,000 WUs queued. (Obviously my CERN customers
are not all on vacation, or at least they took their laptop!)

There are about 300,000 WUs queued of which about 53,000 maximum are mine.
I already got back about 30,000 reults and I have in addition about
1087 wrong but Validated results. Many of these will be the *null"
file problem results but I have already identified some real new
issues. Should have fun analysing all this. Eric.
ID: 27294 · Report as offensive     Reply Quote
alvin
Avatar

Send message
Joined: 12 Mar 12
Posts: 128
Credit: 20,013,377
RAC: 0
Message 27560 - Posted: 9 Jul 2015, 4:56:48 UTC

New tasks?
Shed some light on us, we are starving
ID: 27560 · Report as offensive     Reply Quote
Profile Tom95134

Send message
Joined: 4 May 07
Posts: 250
Credit: 826,541
RAC: 0
Message 27562 - Posted: 9 Jul 2015, 21:50:16 UTC - in response to Message 27560.  

New tasks?
Shed some light on us, we are starving


How can you be "starving" when you participate in 21 Projects. I'm doing work for 4 and I always have enough work to keep busy. Granted, it's not always LHC Tasks but I just keep crunching away.

LHC Tasks tend to be feast or famine. It's the nature of the work they are doing for the LHC.
ID: 27562 · Report as offensive     Reply Quote
alvin
Avatar

Send message
Joined: 12 Mar 12
Posts: 128
Credit: 20,013,377
RAC: 0
Message 27563 - Posted: 10 Jul 2015, 15:36:06 UTC - in response to Message 27562.  
Last modified: 10 Jul 2015, 15:36:47 UTC

New tasks?
Shed some light on us, we are starving


How can you be "starving" when you participate in 21 Projects. I'm doing work for 4 and I always have enough work to keep busy. Granted, it's not always LHC Tasks but I just keep crunching away.

LHC Tasks tend to be feast or famine. It's the nature of the work they are doing for the LHC.

all boinc projects are ranged for me and LHC is one of top ones.
ID: 27563 · Report as offensive     Reply Quote
Profile [AF>France>Bourgogne]Patouchon

Send message
Joined: 13 Sep 05
Posts: 4
Credit: 1,559,173
RAC: 0
Message 27564 - Posted: 11 Jul 2015, 6:36:20 UTC

je n'arrive pas à avoir de travail depuis environ le 26 juin, mais je vois que le nombre de Wus est très petit ces jour-ci. on veut du travail, on veut du travail!
i can't get work these days, and specialy from the 26th of june. i can se than the amount of wus is low those days. we want work, we want work !

greatings to everybody
ID: 27564 · Report as offensive     Reply Quote
Thund3rb1rd

Send message
Joined: 24 Jul 05
Posts: 17
Credit: 2,342,022
RAC: 271
Message 27565 - Posted: 11 Jul 2015, 7:10:52 UTC

Geez, folks. Get off Eric's back. It isn't like he sits up all night with a hammer and chisel making tiny rocks out of big ones. I remember not too many years ago before the LHC came on-line when there were no work units at all for many months at a stretch. WU will come when they come. Surely you all are working on other projects to pass the time.
ID: 27565 · Report as offensive     Reply Quote
Profile Tom95134

Send message
Joined: 4 May 07
Posts: 250
Credit: 826,541
RAC: 0
Message 27566 - Posted: 11 Jul 2015, 21:51:48 UTC - in response to Message 27565.  

HERE! HERE!

Geez, folks. Get off Eric's back. It isn't like he sits up all night with a hammer and chisel making tiny rocks out of big ones. I remember not too many years ago before the LHC came on-line when there were no work units at all for many months at a stretch. WU will come when they come. Surely you all are working on other projects to pass the time.
ID: 27566 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · Next

Message boards : Number crunching : Available work?


©2024 CERN