Message boards :
Number crunching :
Available work?
Message board moderation
Previous · 1 · 2 · 3 · 4 · Next
Author | Message |
---|---|
Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0 |
Trying out 3 rights sounds like a good plan to test. However, don't you still need to have a total of 5, like we used to have? Right now (if I understand it correctly) having a total of 3 means that if 1 fails, it will not send out new and noone will get credit? Or is it just me reading it wrong? Almost, if I am right, if the three don't match an additional WU is put to the end of the queue :-( and should be assigned to a "reliable" host. So unless I get three empty/null results, I should get a good result, or by a miracle I may get the same bad result from three hosts and that would be a really good clue. Eric. P.S. I seem to remember a comment about available work that I can't find (03:00 here!). Work tends to come in batches and often different customers at the same time. Since we have first in first out the different customers then slow each other down in real time. I think we should rather run three different studies consecutively, 1st user waits three days say, 2nd 6 days, and 3rd 9 days. Is this better than 3 cutomers waiting 9 days? A classic scheduling issue. I am thinking of introducing an additional layer of control and monitoring between the customer and BOINC. This would allow, customer to enquire about study progress, CERN speciific scheduling, and better management of CERN local resources like disk space. FIRST solve the annoying result differences!!! |
Send message Joined: 12 Mar 12 Posts: 128 Credit: 20,013,377 RAC: 0 |
Interestingly I see significant drop in "inconclusive" results last days - only 2-4 per day, not scores as before. |
Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0 |
Remember I suspended a big culprit, but it takes time to become effective. Expect a real fix soonest. Eric. |
Send message Joined: 27 Oct 07 Posts: 186 Credit: 3,297,640 RAC: 0 |
P.S. I seem to remember a comment about available work that I can't find I remember making a remark like that last time we had Lots of Work! It was a fairly trivial throw-away comment at the time, but a few days later we had quite a big problem with Error reported by file upload server: Server is out of disk space. I can't help thinking the two events were related, and precisely for that reason - if a tie-breaker needs to be issued and goes to the end of a very long 'all researchers' combined queue, then all the result files for the workunit have to hang around for that much longer and clog up all the pipes. |
Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0 |
Right; in the meantime I have just found a bug with my gfortzran version, either in gcc or the gcc/gfortran interface. I am not pursuing it because production is with ifort and all works fine with older versions of gcc/gfortran. I am really praying that my current Easter tests will give insight to the current ifort production invalid results. "Only" a very small percentage of WUs, but really annoying. Eric. (One day I hope to get SixTrack into SpecFP tests again or at least make it freely available as a compiler/hardware test program.) |
Send message Joined: 3 Jul 08 Posts: 20 Credit: 8,281,604 RAC: 0 |
... having a total of 3 means that if 1 fails, it will not send out new and noone will get credit? I think you mean about this: 30565542 I am one of the two computers with 17000 seconds. The faulty computer 10353061 nearly broken all tasks. This computer has a "EXIT_TIME_LIMIT_EXCEEDED" error after exactly 252.06 or 2520.60 seconds on all tasks. Unusual. Bye, Grubix. PS: I'm not so interested in credits, but in science. :-) |
Send message Joined: 9 Jan 08 Posts: 66 Credit: 727,923 RAC: 0 |
Jep. That was what I was thinking about. Same here, in it for the Science and not the credits. I just fear that if a task gets rejected it's lost work even though 2 computers came with the correct answer. |
Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0 |
Well thanks for your patience and support. I reckon I messed up again. The error limit of 3 is clearly wrong. Should be much bigger, like 10 say. I just tried to change it and I am not allowed to. I don't know why. I changed the file just yesterday. However, ls -l shows it hasn't changed since last October. i Need to try again later. Eric. |
Send message Joined: 3 Oct 06 Posts: 101 Credit: 8,994,586 RAC: 0 |
Just an example... Einstein@home uses minimum quorum 2, maximum tasks - 20... Believe or not, by this extremely high reserve, rarely but sometimes several WUs are getting status "to many errors"... ;-) |
Send message Joined: 27 Oct 07 Posts: 186 Credit: 3,297,640 RAC: 0 |
Just an example... That approach works for Einstein, because they are performing long searches (many months) through large but consistent datasets. The preparation of individual workunits from those datasets can be automated, and very rarely results in significant numbers of 'impossible' tasks. When they do happen, it's usually because the staff are preparing a new search or application, and are paying close attention so they can catch problems with a small test batch quickly. Here, I get the impression that there's more direct human input into the preparation of smaller production batches. It's still rare, but it's more likely that one whole batch will go wrong. With a very high maximum replication number, and occasional very long queues, any bad batch would recirculate through the system for a long time until the very last WU met its maker. We certainly need a 'maximum tasks' limit high enough to allow for some errors to happen before quorum is reached, but as in life - moderation in all things. |
Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0 |
Thanks Richard; for production we use as shown below. I really goofed up with my test runs. Eric. # The number of redundant calculations. Set to two or more to achieve redudancy. redundancy=2 # The number of copies of the workunit to issue to clients. Must be at least the # number of redundant calculations or higher if a loss of results is expected # or if the result should to be obtained fast. copies=2 # The number of errors from clients before the workunit is declared to have # an error. errors=5 # The total number of clients to issue the workunit to before it is declared # to have an error. numIssues=5 # The total number of returned results without a concensus is found before the # workunit is declared to have an error. resultsWithoutConcensus=3 |
Send message Joined: 27 Oct 07 Posts: 186 Credit: 3,297,640 RAC: 0 |
Yes, it's a complicated juggling act - made more so for users, because we only see the simplified "max # of error/total/success tasks", and I wasn't aware of some of the nuances of definition of some of those entries until today. It looks as if https://boinc.berkeley.edu/trac/wiki/JobIn#scheduling is the place to be. I think I come to the conclusion - but I could well be wrong - that: max_success_results (your resultsWithoutConcensus) should always be above min_quorum (your redundancy), but perhaps only by one - that would allow one inconclusive result, but not go on to allow an increased risk of two bad hosts validating each other. So these two values should march in step - if one changes, they both should be changed. max_error_results (your errors) could perhaps come down a bit, depending on your observed sporadic error rate across the population of hosts. This value would be the one which kills a 'bad WU', in most cases, so it should be larger than the sporadic error rate by enough to avoid too many false triggers, but not by enough to allow bad WUs to clog the database. If max_error_results and max_success_results are set properly, one or other of them will always come into play before max_total_results (your numIssues), so maybe Einstein's value of 20 isn't so far wrong after all. But I think max_total_results should be set >= max_error + max_success, else it will kick in prematurely at, say, one success plus four errors, or two inconclusive plus three errors. |
Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0 |
I have of course stopped submitting WUs. My cases are intermixed with regular production and while I could identify mw WUs I don't know how to delete them other than one at a time (or by range) :-(. On the other hand as planned I am getting valuable information back about Validated but wrong results as well as more info about the empty results files. Tomorrow I will try and identify my WUs and perhaps find a finite number of ranges to delete..... With the help of my colleague Alex we are very close to identifying the gcc problem. Eric. |
Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0 |
Well, first beams circulated today! We currently have some 300,000 WUs queued. (Obviously my CERN customers are not all on vacation, or at least they took their laptop!) There are about 300,000 WUs queued of which about 53,000 maximum are mine. I already got back about 30,000 reults and I have in addition about 1087 wrong but Validated results. Many of these will be the *null" file problem results but I have already identified some real new issues. Should have fun analysing all this. Eric. |
Send message Joined: 12 Mar 12 Posts: 128 Credit: 20,013,377 RAC: 0 |
New tasks? Shed some light on us, we are starving |
Send message Joined: 4 May 07 Posts: 250 Credit: 826,541 RAC: 0 |
New tasks? How can you be "starving" when you participate in 21 Projects. I'm doing work for 4 and I always have enough work to keep busy. Granted, it's not always LHC Tasks but I just keep crunching away. LHC Tasks tend to be feast or famine. It's the nature of the work they are doing for the LHC. |
Send message Joined: 12 Mar 12 Posts: 128 Credit: 20,013,377 RAC: 0 |
New tasks? all boinc projects are ranged for me and LHC is one of top ones. |
Send message Joined: 13 Sep 05 Posts: 4 Credit: 1,559,173 RAC: 0 |
je n'arrive pas à avoir de travail depuis environ le 26 juin, mais je vois que le nombre de Wus est très petit ces jour-ci. on veut du travail, on veut du travail! i can't get work these days, and specialy from the 26th of june. i can se than the amount of wus is low those days. we want work, we want work ! greatings to everybody |
Send message Joined: 24 Jul 05 Posts: 17 Credit: 2,404,844 RAC: 2 |
Geez, folks. Get off Eric's back. It isn't like he sits up all night with a hammer and chisel making tiny rocks out of big ones. I remember not too many years ago before the LHC came on-line when there were no work units at all for many months at a stretch. WU will come when they come. Surely you all are working on other projects to pass the time. |
Send message Joined: 4 May 07 Posts: 250 Credit: 826,541 RAC: 0 |
HERE! HERE! Geez, folks. Get off Eric's back. It isn't like he sits up all night with a hammer and chisel making tiny rocks out of big ones. I remember not too many years ago before the LHC came on-line when there were no work units at all for many months at a stretch. WU will come when they come. Surely you all are working on other projects to pass the time. |
©2024 CERN