Thread 'Available work?'

Author	Message
Eric Mcintosh Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0	Message 27271 - Posted: 4 Apr 2015, 0:59:55 UTC - in response to Message 27269. Trying out 3 rights sounds like a good plan to test. However, don't you still need to have a total of 5, like we used to have? Right now (if I understand it correctly) having a total of 3 means that if 1 fails, it will not send out new and noone will get credit? Or is it just me reading it wrong? So it will be 3/5/3 (3 wrong, 5 total, 3 right). Then the least resource will be lost and it will still have a chance to validate even though 1 of the 3 are wrong. Almost, if I am right, if the three don't match an additional WU is put to the end of the queue :-( and should be assigned to a "reliable" host. So unless I get three empty/null results, I should get a good result, or by a miracle I may get the same bad result from three hosts and that would be a really good clue. Eric. P.S. I seem to remember a comment about available work that I can't find (03:00 here!). Work tends to come in batches and often different customers at the same time. Since we have first in first out the different customers then slow each other down in real time. I think we should rather run three different studies consecutively, 1st user waits three days say, 2nd 6 days, and 3rd 9 days. Is this better than 3 cutomers waiting 9 days? A classic scheduling issue. I am thinking of introducing an additional layer of control and monitoring between the customer and BOINC. This would allow, customer to enquire about study progress, CERN speciific scheduling, and better management of CERN local resources like disk space. FIRST solve the annoying result differences!!! ID: 27271 · Reply Quote

alvin Send message Joined: 12 Mar 12 Posts: 128 Credit: 20,013,377 RAC: 0	Message 27272 - Posted: 4 Apr 2015, 2:18:20 UTC - in response to Message 27271. Last modified: 4 Apr 2015, 2:20:19 UTC Interestingly I see significant drop in "inconclusive" results last days - only 2-4 per day, not scores as before. ID: 27272 · Reply Quote

Eric Mcintosh Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0	Message 27273 - Posted: 4 Apr 2015, 5:11:48 UTC - in response to Message 27272. Remember I suspended a big culprit, but it takes time to become effective. Expect a real fix soonest. Eric. ID: 27273 · Reply Quote

Richard Haselgrove Send message Joined: 27 Oct 07 Posts: 186 Credit: 3,297,640 RAC: 0	Message 27274 - Posted: 4 Apr 2015, 6:45:02 UTC - in response to Message 27271. P.S. I seem to remember a comment about available work that I can't find (03:00 here!). Work tends to come in batches and often different customers at the same time. Since we have first in first out the different customers then slow each other down in real time. I think we should rather run three different studies consecutively, 1st user waits three days say, 2nd 6 days, and 3rd 9 days. Is this better than 3 cutomers waiting 9 days? A classic scheduling issue. I am thinking of introducing an additional layer of control and monitoring between the customer and BOINC. This would allow, customer to enquire about study progress, CERN speciific scheduling, and better management of CERN local resources like disk space. FIRST solve the annoying result differences!!! I remember making a remark like that last time we had Lots of Work! It was a fairly trivial throw-away comment at the time, but a few days later we had quite a big problem with Error reported by file upload server: Server is out of disk space. I can't help thinking the two events were related, and precisely for that reason - if a tie-breaker needs to be issued and goes to the end of a very long 'all researchers' combined queue, then all the result files for the workunit have to hang around for that much longer and clog up all the pipes. ID: 27274 · Reply Quote

Eric Mcintosh Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0	Message 27276 - Posted: 4 Apr 2015, 16:20:12 UTC - in response to Message 27274. Right; in the meantime I have just found a bug with my gfortzran version, either in gcc or the gcc/gfortran interface. I am not pursuing it because production is with ifort and all works fine with older versions of gcc/gfortran. I am really praying that my current Easter tests will give insight to the current ifort production invalid results. "Only" a very small percentage of WUs, but really annoying. Eric. (One day I hope to get SixTrack into SpecFP tests again or at least make it freely available as a compiler/hardware test program.) ID: 27276 · Reply Quote

Grubix Send message Joined: 3 Jul 08 Posts: 20 Credit: 8,281,604 RAC: 0	Message 27277 - Posted: 4 Apr 2015, 19:10:57 UTC - in response to Message 27269. ... having a total of 3 means that if 1 fails, it will not send out new and noone will get credit? I think you mean about this: 30565542 I am one of the two computers with 17000 seconds. The faulty computer 10353061 nearly broken all tasks. This computer has a "EXIT_TIME_LIMIT_EXCEEDED" error after exactly 252.06 or 2520.60 seconds on all tasks. Unusual. Bye, Grubix. PS: I'm not so interested in credits, but in science. :-) ID: 27277 · Reply Quote

Uffe F Send message Joined: 9 Jan 08 Posts: 66 Credit: 727,923 RAC: 0	Message 27278 - Posted: 4 Apr 2015, 19:35:25 UTC - in response to Message 27277. Jep. That was what I was thinking about. Same here, in it for the Science and not the credits. I just fear that if a task gets rejected it's lost work even though 2 computers came with the correct answer. ID: 27278 · Reply Quote

Eric Mcintosh Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0	Message 27279 - Posted: 4 Apr 2015, 20:00:07 UTC - in response to Message 27278. Well thanks for your patience and support. I reckon I messed up again. The error limit of 3 is clearly wrong. Should be much bigger, like 10 say. I just tried to change it and I am not allowed to. I don't know why. I changed the file just yesterday. However, ls -l shows it hasn't changed since last October. i Need to try again later. Eric. ID: 27279 · Reply Quote

metalius Send message Joined: 3 Oct 06 Posts: 115 Credit: 9,102,282 RAC: 3,753	Message 27289 - Posted: 5 Apr 2015, 9:27:16 UTC Just an example... Einstein@home uses minimum quorum 2, maximum tasks - 20... Believe or not, by this extremely high reserve, rarely but sometimes several WUs are getting status "to many errors"... ;-) ID: 27289 · Reply Quote

Richard Haselgrove Send message Joined: 27 Oct 07 Posts: 186 Credit: 3,297,640 RAC: 0	Message 27290 - Posted: 5 Apr 2015, 10:10:11 UTC - in response to Message 27289. Just an example... Einstein@home uses minimum quorum 2, maximum tasks - 20... Believe or not, by this extremely high reserve, rarely but sometimes several WUs are getting status "to many errors"... ;-) That approach works for Einstein, because they are performing long searches (many months) through large but consistent datasets. The preparation of individual workunits from those datasets can be automated, and very rarely results in significant numbers of 'impossible' tasks. When they do happen, it's usually because the staff are preparing a new search or application, and are paying close attention so they can catch problems with a small test batch quickly. Here, I get the impression that there's more direct human input into the preparation of smaller production batches. It's still rare, but it's more likely that one whole batch will go wrong. With a very high maximum replication number, and occasional very long queues, any bad batch would recirculate through the system for a long time until the very last WU met its maker. We certainly need a 'maximum tasks' limit high enough to allow for some errors to happen before quorum is reached, but as in life - moderation in all things. ID: 27290 · Reply Quote

Eric Mcintosh Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0	Message 27291 - Posted: 5 Apr 2015, 10:27:10 UTC - in response to Message 27290. Thanks Richard; for production we use as shown below. I really goofed up with my test runs. Eric. # The number of redundant calculations. Set to two or more to achieve redudancy. redundancy=2 # The number of copies of the workunit to issue to clients. Must be at least the # number of redundant calculations or higher if a loss of results is expected # or if the result should to be obtained fast. copies=2 # The number of errors from clients before the workunit is declared to have # an error. errors=5 # The total number of clients to issue the workunit to before it is declared # to have an error. numIssues=5 # The total number of returned results without a concensus is found before the # workunit is declared to have an error. resultsWithoutConcensus=3 ID: 27291 · Reply Quote

Richard Haselgrove Send message Joined: 27 Oct 07 Posts: 186 Credit: 3,297,640 RAC: 0	Message 27292 - Posted: 5 Apr 2015, 13:31:19 UTC - in response to Message 27291. Yes, it's a complicated juggling act - made more so for users, because we only see the simplified "max # of error/total/success tasks", and I wasn't aware of some of the nuances of definition of some of those entries until today. It looks as if https://boinc.berkeley.edu/trac/wiki/JobIn#scheduling is the place to be. I think I come to the conclusion - but I could well be wrong - that: max_success_results (your resultsWithoutConcensus) should always be above min_quorum (your redundancy), but perhaps only by one - that would allow one inconclusive result, but not go on to allow an increased risk of two bad hosts validating each other. So these two values should march in step - if one changes, they both should be changed. max_error_results (your errors) could perhaps come down a bit, depending on your observed sporadic error rate across the population of hosts. This value would be the one which kills a 'bad WU', in most cases, so it should be larger than the sporadic error rate by enough to avoid too many false triggers, but not by enough to allow bad WUs to clog the database. If max_error_results and max_success_results are set properly, one or other of them will always come into play before max_total_results (your numIssues), so maybe Einstein's value of 20 isn't so far wrong after all. But I think max_total_results should be set >= max_error + max_success, else it will kick in prematurely at, say, one success plus four errors, or two inconclusive plus three errors. ID: 27292 · Reply Quote

Eric Mcintosh Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0	Message 27293 - Posted: 5 Apr 2015, 20:17:50 UTC I have of course stopped submitting WUs. My cases are intermixed with regular production and while I could identify mw WUs I don't know how to delete them other than one at a time (or by range) :-(. On the other hand as planned I am getting valuable information back about Validated but wrong results as well as more info about the empty results files. Tomorrow I will try and identify my WUs and perhaps find a finite number of ranges to delete..... With the help of my colleague Alex we are very close to identifying the gcc problem. Eric. ID: 27293 · Reply Quote

Eric Mcintosh Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0	Message 27294 - Posted: 6 Apr 2015, 2:02:05 UTC Well, first beams circulated today! We currently have some 300,000 WUs queued. (Obviously my CERN customers are not all on vacation, or at least they took their laptop!) There are about 300,000 WUs queued of which about 53,000 maximum are mine. I already got back about 30,000 reults and I have in addition about 1087 wrong but Validated results. Many of these will be the *null" file problem results but I have already identified some real new issues. Should have fun analysing all this. Eric. ID: 27294 · Reply Quote

alvin Send message Joined: 12 Mar 12 Posts: 128 Credit: 20,013,377 RAC: 0	Message 27560 - Posted: 9 Jul 2015, 4:56:48 UTC New tasks? Shed some light on us, we are starving ID: 27560 · Reply Quote

Tom95134 Send message Joined: 4 May 07 Posts: 250 Credit: 826,541 RAC: 0	Message 27562 - Posted: 9 Jul 2015, 21:50:16 UTC - in response to Message 27560. New tasks? Shed some light on us, we are starving How can you be "starving" when you participate in 21 Projects. I'm doing work for 4 and I always have enough work to keep busy. Granted, it's not always LHC Tasks but I just keep crunching away. LHC Tasks tend to be feast or famine. It's the nature of the work they are doing for the LHC. ID: 27562 · Reply Quote

alvin Send message Joined: 12 Mar 12 Posts: 128 Credit: 20,013,377 RAC: 0	Message 27563 - Posted: 10 Jul 2015, 15:36:06 UTC - in response to Message 27562. Last modified: 10 Jul 2015, 15:36:47 UTC New tasks? Shed some light on us, we are starving How can you be "starving" when you participate in 21 Projects. I'm doing work for 4 and I always have enough work to keep busy. Granted, it's not always LHC Tasks but I just keep crunching away. LHC Tasks tend to be feast or famine. It's the nature of the work they are doing for the LHC. all boinc projects are ranged for me and LHC is one of top ones. ID: 27563 · Reply Quote

[AF>France>Bourgogne]Patouchon Send message Joined: 13 Sep 05 Posts: 4 Credit: 1,559,173 RAC: 0	Message 27564 - Posted: 11 Jul 2015, 6:36:20 UTC je n'arrive pas Ã avoir de travail depuis environ le 26 juin, mais je vois que le nombre de Wus est trÃ¨s petit ces jour-ci. on veut du travail, on veut du travail! i can't get work these days, and specialy from the 26th of june. i can se than the amount of wus is low those days. we want work, we want work ! greatings to everybody ID: 27564 · Reply Quote

Thund3rb1rd Send message Joined: 24 Jul 05 Posts: 17 Credit: 2,470,713 RAC: 55	Message 27565 - Posted: 11 Jul 2015, 7:10:52 UTC Geez, folks. Get off Eric's back. It isn't like he sits up all night with a hammer and chisel making tiny rocks out of big ones. I remember not too many years ago before the LHC came on-line when there were no work units at all for many months at a stretch. WU will come when they come. Surely you all are working on other projects to pass the time. ID: 27565 · Reply Quote

Tom95134 Send message Joined: 4 May 07 Posts: 250 Credit: 826,541 RAC: 0	Message 27566 - Posted: 11 Jul 2015, 21:51:48 UTC - in response to Message 27565. HERE! HERE! Geez, folks. Get off Eric's back. It isn't like he sits up all night with a hammer and chisel making tiny rocks out of big ones. I remember not too many years ago before the LHC came on-line when there were no work units at all for many months at a stretch. WU will come when they come. Surely you all are working on other projects to pass the time. ID: 27566 · Reply Quote