41) Message boards : Number crunching : Can't Access Work Units (Message 17342)
Posted 13 Jul 2007 by Profile Gary Roberts
Post:
...I don't think any of those will get out to the public...


Of course they did - I got about 30 of them spread over just two boxes. This is an example of the full quorum for one of them. As you can see, it was over in a flash. Yes there were some longer running ones but the whole 30 of mine were crunched and returned in less than a day. I didn't even know about it until I saw the reference to the small number "in progress" on the front page :).

42) Message boards : Number crunching : Machine with a 40 Day Cache that doesn't Timeout????? (Message 16270)
Posted 7 Feb 2007 by Profile Gary Roberts
Post:

I am only expressing a view, of course, and so is Gary. Neither of us is going to be offended if N & A take a different view on the priorities.


Exactly!!

Ive made my points here (and perhaps Gary will refute them all) in the hope that my thoughts will be helpful to the new admins, not to tell them their job.

R~~


Nope.... No refutation from me. Just general agreement.

I work on the principle that most people interpret "low priority" to mean "never" or perhaps a month this side of eternity in the overall scheme of things. On the other hand "high priority" means that "we'll get to it when we can" and that will probably be sometime before the project has completely finished :).

Another way that you could look at it is that "high priority" means that we have actually recorded it on the "ToDo List" - somewhere :). You can figure out for yourself where "low priority" stuff gets put :).

Joking aside, I'm just happy to have the problem somewhere "on the Radar".

43) Message boards : Number crunching : Machine with a 40 Day Cache that doesn't Timeout????? (Message 16246)
Posted 6 Feb 2007 by Profile Gary Roberts
Post:
It is also worth saying that, in my opinion, this was almost certainly not a deliberate exploit.


I don't really care if it was deliberate or not and I'm certainly not making any accusations on that score. My interest is really more in identifying deficiencies in the work distribution process that need to be rectified.

For one thing, only one of this users boxes seems to have been affected...


Actually not correct if you look closely. I've found examples of post deadline work being granted credit on each of the five most recently communicating boxes in the list. This one is particularly interesting. It received new work on 31-Dec, 03-Jan, 04-Jan and 08-Jan. All of the January work with the exception of two WUs was completed and returned between 07-Jan and 11-Jan. Two 08-Jan WUs were returned on 15-Jan. The interesting bit is that all the 31-Dec WUs (34 in total) were returned between 11-Jan and 14-Jan, ie, after most of the January work and way after the deadline!! Surely the machine should have gone into EDF mode and completed all the older work first?? On 07-Jan when the deadline had passed for the 31-Dec work, how was the box able to get further new work on 08-Jan with all the old expired work still hanging around??

Maybe there is a rational explanation but as I understand it BOINC just shouldn't allow this to happen. It should have refused to download more work on 08-Jan.

The run length of work on this project is only loosely connected ....


I'm fully aware of all this and whilst you do get variations of the sort you describe, I've never seen those variations to be significant enough to create a 40 day cache!! :). In fact, the normal sprinkling of short lasting results which don't run the full distance means that work is often completed in a significantly shorter time than the cache size would suggest.

Hopefully, these sorts of oddities will be eliminated when the admins get a more recent version of the BOINC server software up and running and properly debugged.

44) Message boards : Number crunching : Machine with a 40 Day Cache that doesn't Timeout????? (Message 16244)
Posted 6 Feb 2007 by Profile Gary Roberts
Post:

It seems to me that they have switched off the test for a result being past the deadline.


I think this is precisely the case. I've examined some work that one of my machines received on 10-Jan which was work resent as a result of a quorum not being completed for some 03-Jan work. Three of the machines in the original 03-Jan quorum did not return the work by the deadline so three followup results were sent out on 10-Jan. Those three extra results were returned almost immediately and then on 12-Jan one of the original three defaulters returned its result and was awarded credit. I've seen this sort of thing several times now and I haven't been able to find a single example of late returned work being denied credit.

This needs to be given a high priority for fixing as it's otherwise going to be exploited mercilessly and will make the orderly distribution of new work a total shambles.

If I am right, then chalk up another low priority task for our new admins.


I reckon you are right but make that high priority rather than low.

(you may remember me as Gravywavy on Einstein)


Of course I know exactly who you are :). I lurk on these boards occasionally and will sometimes post if I get sufficiently stirred up about something :). I certainly read everything you post if I happen to be around at the time because I know it will be worth reading :).


45) Message boards : Number crunching : That was fast (Message 16242)
Posted 6 Feb 2007 by Profile Gary Roberts
Post:
well, those WU's got sent out quick.... impressive....


At first glance, yes, it appears impressive, but when you look into it, it's not that surprising. Compared with past runs, the WUs seem to be of very short duration so that a machine asking for a day's work gets a whole bunch.

I've noticed several of my machines that got a sizeable number each but have now finished them all.

46) Message boards : Number crunching : That was fast (Message 16237)
Posted 6 Feb 2007 by Profile Gary Roberts
Post:
Weird.

BOINC says "no work from project" when I update, but frontpage says ~40,000 WU available. What gives?


Not wierd at all!! :). Just read it more carefully!! :)

The front page says that there are approximately 40,000 results in progress. This means they are out there on user machines being done. Currently there are no more to send.

47) Message boards : Number crunching : Machine with a 40 Day Cache that doesn't Timeout????? (Message 16236)
Posted 6 Feb 2007 by Profile Gary Roberts
Post:
If you take a look at the results list for this machine, you will see something startling. It's a dual processor Opteron 240 and is part of a group of machines owned by this user.

There are a total of 1091 results listed for the machine of which around 500 were issued around 1:00 to 2:00 UTC on 29-Dec-2006. According to the results list, each result was taking close to 14,000 secs (around 3.8 hours) to complete. Taking in to account the 2 processors, 500 results represents about 40 days of work per processor (250 X 3.8 / 24 = 39.58 days).

That in itself represents quite a feat. How do you convince a server to send you 40 days work where the deadline is usually around 6-7 days? I guess the answer to that is to convince the server that you can do the work 10 times faster than you actually can and then get the server to send you 4 days of work in one big hit before a result gets crunched and BOINC finds out you were lying :).

Normally this sort of behaviour is quite futile as the server will cut off all the outstanding work after the deadline expires and all you will succeed in achieving is 33 days of wasted work that might have to be sent out again, if a quorum hasn't been formed. That's correct isn't it?? So how come this machine was still returning work and being granted credit a full 4 weeks after the work was issued??

Or am I missing something here???

48) Message boards : Number crunching : Could I ask for an update on the frontpage? (Message 16184)
Posted 24 Jan 2007 by Profile Gary Roberts
Post:
Better to wait and have everything running correctly, than rush in and have problems for months to come.


What has any of that got to do with putting a status update on the front page???

After all they did promise to try to keep us updated as to progress...

49) Message boards : Number crunching : How about MAC ? (Message 16136)
Posted 15 Jan 2007 by Profile Gary Roberts
Post:
Do you plan to make application for OS X ?


Go to the front page of the project and follow the link to the "Questions and Problems" lists. Have a read of the very top thread in the "Wish" list.

50) Message boards : Number crunching : Past Due Date (Message 16091)
Posted 10 Jan 2007 by Profile Gary Roberts
Post:
.... Am I missing something here ???


Yes, unfortunately I believe you are :).

The granting of full, partial or zero credit has nothing to do with returning results late in the cycle. Even if you are within the deadline by just 1 second, valid results will still get full credit and that's the way it should be.

If your results do not validate, you get zero credit. If your results are close but not within the quite strict tolerances required by LHC, you are likely to get half credit. I think this is a very fair system because "close" results may be caused by hardware or software differences in the computer/OS/system libraries being used and are therefore outside the direct control of the user. In these cases the users should get some reward for their efforts.

On the more general question of the return of the 4th and 5th results when the quorum of three has already been formed, this should not be treated any differently and there should not be any stigma or penalty for being last to return. In quite a few cases, 4th and 5th results actually get used and do help shorten the overall time taken compared with what it would have been if only three had been issued initially. I'm sure that this is something that the project staff would be well aware of.

A useful option would be for the core client to ask the server if a quorum exists before starting a new result. That way, an unneeded result could be aborted before computation started and a fresh replacement downloaded (if available). However, I imagine this would put a lot of extra strain on the servers and therefore might not be feasible. It would be very useful however if a machine has had to be switched off for a day or two and the cached work units were therefore a bit stale. Aborting stale unneeded results and moving on to something more useful would seem to be a good idea.
51) Message boards : Number crunching : SOME greedy users (Message 16062)
Posted 6 Jan 2007 by Profile Gary Roberts
Post:

But what's the solution, having constant work, or not having greedy users?


Actually, you only need a solution if there is a problem and I don't think there really is a problem. Certainly not a problem that needs to be solved urgently as far as the project staff is concerned.

By now everybody should know that work is patchy and unreliable. Any person concerned about this should make it a priority to research the strategies that could be used to improve the chance of getting appropriate work when work is available. My opinion is that there really isn't this large bunch of greedy users with huge deep caches that are immediately exhausting the work as soon as it appears. If there were, why is it that the work released around 30/31 December took well over a full day to be exhausted?
52) Message boards : Number crunching : SOME greedy users (Message 16061)
Posted 6 Jan 2007 by Profile Gary Roberts
Post:

Ohh so he says you're wrong (true or not), and you get pedantic about grammar, trying to find the most little details where he is wrong? How nice...


I'm sorry but I don't understand what you are on about. He didn't say I was wrong - in fact he has never responded to me at any point that I'm aware of. He made a statement saying that FalconFly was missing the point and I responded to him to question the notion of who was actually missing the point.

I must say I enjoyed your treatise on grammar pedantry. There is nothing about grammar involved here. The grammar is actually correct, although many people do have problems properly handling the meaning when double negatives are used :). It's just that his words have the totally opposite meaning to what I think he was trying to say.

So you decide to get involved and stir it all up with a fallacious claim about pedantry... How nice...
53) Message boards : Number crunching : SOME greedy users (Message 16058)
Posted 6 Jan 2007 by Profile Gary Roberts
Post:

Falcon you totally miss the point.

The complaint is NOT the lack of an endless stream of work. The complaint is not being unable to aquire work because a number of users have deep caches - which locks work away for days that others could be doing.


Actually, I'm sorry to say that I think it is really you who is the one missing the point. The point really is the patchiness of the work because there wouldn't be any complaints about greedy users with deep caches if there was a continuous supply of work.

Also, I think you meant "...not being able..." rather than "...not being unable..." :). It would be mind blowing to see users complaining if they indeed were able (ie not being unable...) to get work :).
54) Message boards : Number crunching : SOME greedy users (Message 16057)
Posted 6 Jan 2007 by Profile Gary Roberts
Post:
The last time LHC had units to work on, about 5 or 6 days ago one of my three computer got about 20 to 30 units ( I didn't count them). The other two didn't get any. Just SETI and LHC are the only working programs on these machines and they share 100% computer time. I have a five day cache.


Getting 30 WUs with a five day cache is pretty much what you would expect to happen. I'm very surprised that your other two machines didn't get any as the work was available for quite a long time. Those machines must have been on a very long backoff (more than 24 hours) to have missed out completely if Seti was your only other project. Alternatively, your Seti cache may have been overfull for some reason.

The units I have still working on the LHC program have a completion time of about 5 hours +/-. Since LHC has a completion return date sooner than SETI, LHC are being completed first, two units at a time. I think the machines are running like they should,


Yes, if you have a dual core (?HT) machine doing two units at a time every 5 hours you will have no problem completing the work within the deadline. Hovever, for LHC with its relatively short deadline, a 5 day cache is too big. Larry1186 has given you very good advice about structuring your resource shares to favour LHC and you can do this by changing the resource share value in your project specific preferences (not general preferences) on each relevant website. With both LHC and Seti having (different) issues with work supply, it is quite likely that both could be out of work simultaneously. To guard against this it would be useful (just like larry1186 has done) to have at least one more project in the list. Einstein@Home (EAH) would be a good choice if you are into physical sciences. You could set up a resource share LHC/Seti/EAH of 900/50/50 and a cache size of say 2.5 days max which should work very well. Seti and EAH would share your machine equally and back each other up for those periods where there was no LHC work.

but with all this talk of a few people being greedy in the download of work units I was concerned that perhaps my machines were taking too much and was wondering just what is too many units?


The talk is just that -- talk. A capable machine getting 30 work units is not being greedy. You will finish and return the work well within the deadline so there is no problem. Too many WUs is the situation where you cannot complete the work within the deadline (ie if your machine is not running 24/7 or you have too much work from other projects as well. BOINC will try very hard to not let this happen in the first place. You can help BOINC by reducing your cache size as suggested.
55) Message boards : Number crunching : Totally Disgusting (Message 16056)
Posted 6 Jan 2007 by Profile Gary Roberts
Post:

One of those unofficial "calibrating" clients let you set how many CPUs to claim to have, so you run 4 units at a time on a single-core...


If you are referring to the Boinc Studio client, which does indeed allow you to easily fake the number of CPUs your box has, you are incorrect in your claim about running four units at once. The purpose of claiming four CPUs was to work around the very frustrating daily result limit that Einstein@Home had at the time. Many people with recent machines (and not particularly fast ones either) were exhausting their daily allowance of work in less than 12 hours. By faking the number of CPUs to four, people were able to get four times the daily limit from the server if they needed to. It was a lifesaver for many people with moderately fast boxes.

However, the work was processed one unit at a time per actual CPU and NOT one unit per faked CPU. Whilst it is possible to run multiple science app instances per CPU (and you don't need Boinc Studio to do this), you would be silly to do so as each instance effectively consumes 100% of the cpu when it is running. The context switching overhead created by cycling between the instances would consume enough CPU cycles to give a reduction in total output rather than an increase.
56) Message boards : Number crunching : SOME greedy users (Message 16040)
Posted 5 Jan 2007 by Profile Gary Roberts
Post:
I do know that when I receive LHC units my SETI units take second place due to completion time as LHC units usually have a short cpmletion return time.

You might have a DCF problem. Stop BOINC, edit your client_state.xml to have <duration_correction_factor>1.0</duration_correction_factor> for LHC instead of whatever number it has now. I'd explain better but I got to go now...


The above is possibly quite bad advice if you don't do some basic checks first.

If your actual result completion time for the normal full running work unit is approximately the same as the predicted time shown before crunching starts, there is absolutely no need to change the DCF. In any case, the BOINC client will adjust the value as required, over time, even if it is a long way off. By editing client_state.xml by hand, you are taking the risk that a simple typo mistake on your part could cause you to trash your whole cache of work units.

Please realise that you will always get a burst of work units when an out-of-work project suddenly gets work again. The critical factor in this case is the 'connect to network' setting in your general preferences. Seti was mentioned as a coexisting project so I'm guessing that a cache size of several days was in force to guard against times when Seti is having problems.

As an example of what can happen, let us assume just two projects, Seti and LHC, with a 50/50 resource share and a cache size of 5 days. This might be considered fine for Seti and there might be times when you could need this large a cache to prevent running out of work if LHC was in its 'dry' period and you had no other projects to fall back on. If LHC suddenly gets work, BOINC will download effectively 10 days work, ie a full 5 day cache for LHC on top of a 5 day Seti cache. The problem would be magnified if the LHC share was less than 50/50. If the ratio were 80/20 in favour of Seti, BOINC would still download 5 days of work for LHC which could theoretically take 25 days to complete because of the 20% resource share. BOINC has mechanisms to protect against this by deciding that the computer is overcommited and by using EDF mode to clear the 'at-risk' work.

From the description given in Yank's post, I'm guessing that the solution is to lower the 'connect to network' setting a bit if there is too much work being downloaded when LHC suddenly has new work. However Yank needs to give a lot more information to be sure about this.
57) Message boards : Number crunching : Past Due Date (Message 16019)
Posted 4 Jan 2007 by Profile Gary Roberts
Post:
... Probably will be a day or two late. Should I abort or continue?


Your best plan is to check each work unit on the website to see if a quorum has already been formed. If it has, there is absolutely no benefit to the project if you continue to process and return your result. If you managed to squeeze it in before the deadline then you would get credit but the result would not be used by the project anyway. If it's after the deadline - no credit as well as no use to the project.

The other thing is to look at crunch times of other people to see if those results are going to run the full expected time or if they might finish early. You never know, you might have a batch of all short crunch times left :).

EDIT: I just had a quick look at your results on the website. Your P4 machine has 8 outstanding results, all of which have already had the quorum formed and the results validated. In each case, the awarded credits are close to 30 so there are no short run times in that lot. The project already has the information it needs from those workunits so, if you are not interested in credit, you should abort them all forthwith. However, don't take my word for it - go check for yourself and learn how the system works.

58) Message boards : Number crunching : 30 hours of work availability - did u get some? (Message 15975)
Posted 3 Jan 2007 by Profile Gary Roberts
Post:

Protons, of course!!
LHC stands for Large Hadron Collider, a proton is a hadron.


When I went to school in the 50s and 60s, we were taught that the basic indivisible building blocks of matter were protons, neutrons and electrons. Now there is this mind blowing array of exotic theoretical particles, apparently all experimentally observed with the single exception of the Higgs boson. It will be very interesting to see what happens if the LHC doesn't finally unmask this little sucker :).
59) Message boards : Number crunching : I found an Anomaly (Message 15974)
Posted 2 Jan 2007 by Profile Gary Roberts
Post:
I've been here over two years and didn't know that and have never seen it in the results before the one I pointed out.


Hey Steve,
Maybe your have forgotten about this thread which caused quite a stir at the time :). Warning - it's quite a long thread with a bit of aggro at times but good for a laugh at how passionate some people become about credit.

If you just want the definitive answer about half credits, check out this post in that thread by Markku Degerholm who was one of the admins at the time, I think. My impression was that he entered the fray in an attempt to hose things down. He was pretty successful as the thread (which had been running for over a month) died soon after he made a couple of posts.

As far as not seeing half credits in the results, I guess it depends on how much you stress your computers. I like to overclock moderately and I've found that LHC WUs are the most sensitive indicator of the first tiny instability if you push it a little bit too far. I've had machines that are rock solid on other projects which give an occasional half credit result on LHC. If I backoff the overclock just a fraction, the problem usually disappears. I think it's a very useful feature for judging machine stability.
60) Message boards : Number crunching : 2 WUs won't upload (locked by file_upload_handler PID=20350) (Message 14079)
Posted 20 Jun 2006 by Profile Gary Roberts
Post:
If you stop and restart BOINC, this problem will go away. It appears to be something to do with BOINC still trying to use old DNS values cached somewhere.


Previous 20 · Next 20


©2024 CERN