21) Message boards : Number crunching : macintosh (Message 19316)
Posted 28 Mar 2008 by Brian Silvers
Post:
when units for crunch with Mac platform??


Discussion of this is in this thread
22) Message boards : Number crunching : Initial Replication (Message 19284)
Posted 20 Mar 2008 by Brian Silvers
Post:

But if IR = Q it would never have been sent to 2 of the 5 machines. Thus 2 of the machines have crunched the wu for no good reason.


That is true, if and only if the first 3 were all "strongly similar" (a validation match). This harkens way back to the discussion in this same thread about failure rates on the initial replication. Just because 3 are sent out does not mean that 3 "good" results will come back on the first pass. At the time the decision was made, the server side abort capability didn't exist as it is today. Based on that and the construction goals, this was a valid decision, IMO. Your position is along the lines of "well, if the first 3 are not a match, surely it can't take that long to send out more until you get a matching set". Their position was, "Why don't we try to increase the odds of a matching set on the first pass, given our construction deadlines?"

As someone else mentioned (Keck?), when the LHC is up and running, it is likely that a different app other than SixTrack will be used.

What I guess I'm getting at is so long as the server code is old, this is akin to beating a dead and at least partially decomposed horse. Adding insult to injury upon that is the fact that the science application is being used for construction and fine-tuning purposes which have a need to be reported sooner rather than later.

On the one hand, I hear the reasoning that you could go with shorter deadlines. However, not only does a shorter deadline put deadline pressure on the clients, it also mucks around with the scheduling mechanism in BOINC. People get real bent out of shape about science application X "hogging" their system, so it is conceivable that what a shorter deadline will do is simply shift your "pain" to someone else, and we'll have a new and different vocal group complaining about how the deadlines are too short and are unfair.

If any of the original replication machines had failed, then send it out again. A lot of projects do that, often with a shorter then usual deadline. Some, SIMAP for example, send failed units out again to "known good hosts" that have a proven track record of returning good work promptly, (I receive a good number of those myself).


Not a bad idea, but needs a rework of the server-side code to be supported here, which however is first in need of an upgrade, thus the server-side upgrade is the choke point again. In lieu of all these "what if" alternate methods you and others propose, what is in place is the "best" implementation for what they need to get done, IMO, and in the project's opinion as well.


You and I are unlikely ever to agree on this point. You have behaved in a generally civil manner however and I thank you for the debate.


I'm not saying you're wrong, I'm just saying that given the circumstances, what we have is not really that bad. You and others seem to be looking at the issue through a microscope instead of backing up and looking at all the other factors surrounding the issue.

In summary, is the additional replication "wasteful"? Sometimes, perhaps, but not always. Could it be better? Of course it could, but given the circumstances, it is probably the best solution to meet the project's needs with the minimum amount of impact to those of you who feel as you do (as I pointed out, they could've had IR=10 or even higher)...
23) Message boards : Number crunching : Initial Replication (Message 19281)
Posted 19 Mar 2008 by Brian Silvers
Post:
And now a forth machine has crunched the unit. 31,000+ wasted crunching.


I'm not trying to be overly argumentative here, but there is a flaw in your logic of blanket assuming that it was "wasted"...

That WU reached quorum at 19 Mar 2008 2:30:18 UTC. The fourth host reported at 19 Mar 2008 16:03:19 UTC. The host also shows that they made a scheduler contact at 19 Mar 2008 13:45:36 UTC, which was after quorum on the WU in question. While we cannot unequivocably state that the fourth system didn't contact the scheduler between then and the last time they downloaded a task, which was 17 Mar 2008 21:06:46 UTC, the potential exists for that host to have had no way of knowing that the result they had was no longer needed, nor would a scheduler contact been able to inform them of it, since the potential exists that a contact was made prior to quorum / validation when all was still "ok", then the next contact was not until after validation, but the host had already begun work on the task. IOW, this situation is one that could "fall through the cracks" no matter what is done with the server-side aborts.

Now, this could perhaps strengthen the argument for setting MQ=IR and assigning extra replications as needed, but the project has explicitly stated that they discussed IR=5 at the time and felt it best suited their needs.

Is it possible that if one of the 3 that were in at 19 Mar 2008 2:30:18 UTC did not validate that an extra replication at that point could've been issued and validated in the same amount of time or even less? Sure, it's possible. Is it likely? Who knows...? It is possible though that the fourth task could've gone to a lower powered box with an even smaller resource allocation, thus lengthening the time needed to get another result in and see if it validates. Then there's the possibility of yet another reissue if the fourth result fails to validate and/or it never gets turned in.

Essentially, the project did perform some sort of "due diligence" when setting IR=5, as they could've gone to IR=10. The "extra two" replications are a bit of "insurance". I see both sides to this, that's why I'm saying I just can't get all worked up into a lather like some people do. It's just not that big of a deal to me. More than that, I can see the possibility that it indeed could take longer to get the WU validated if you do IR=MQ, where apparently some of you think that "gee, it can't possibly be that much of an additional wait"...

All in all, I see this as being upset about a relatively minor issue. I'm not saying that it isn't an issue, just that it is not something that "requires" a lot of telling people that think like me that it's not that big of an issue that they are idiots or whatnot. No, you didn't do that, but it was done in this thread and it is still there for all to see...

Anyway, I'm exiting this discussion again.
24) Message boards : Number crunching : Initial Replication (Message 19276)
Posted 19 Mar 2008 by Brian Silvers
Post:
it would seem that lobbying Alex and Neasan will not have a direct impact as all they're going to be able to do is mention it to their boss.

... and thus it has been with QM running the show, and was before when it was run from Geneva. As I said, I have been around here for a long time, a lot longer then you possibly, and with a project like CERN, the politics and empires take precedence over actually doing anything. Hence my one liner...


OK, so if that is the case, then what? As I see it, the only options are to continue on, armed with the knowledge, or to quit. If you continue on, then you could manually abort tasks if it is a great enough concern to you. Alternatively, you could just blanket abort all short-duration tasks and opt to only work on the longer duration tasks, since you have a faster system and the effects of lower resource allocation would not be as evident with those...meaning your faster system could help offset the lower allocation due to being able to complete tasks quicker, while people with higher allocations or lower caches that have slower systems would get bogged down more...taking longer to complete and report...

In any case, I hope to not see a return to the invective that went on in the past in this thread. I'm not accusing you of that, it's just this thread was very much forgotten by most folks until the OP dredged it up again... It is an issue, but I personally can't get all worked up about it like some apparently can...
25) Message boards : Number crunching : Initial Replication (Message 19273)
Posted 19 Mar 2008 by Brian Silvers
Post:

Since when has lobbying acheived anything around here?


Based on the post Richard linked to, it would seem that lobbying Alex and Neasan will not have a direct impact as all they're going to be able to do is mention it to their boss. Looking at it from their perspective, eventually you have to decide whether or not you wish to annoy someone again if they keep telling you "no".

So, what some industrious soul might consider doing is trying to figure out who are likely candidates for the roadblocks and then send them a polite email or make a polite phone call to them, if their number is listed on the university web page...

I'd vote with my feet, but that isn't going to send a message to the admins.


As said, based on the post from Neasan, the "admins" (Alex and Neasan) aren't the roadblock, thus the message needs to go to beyond them...
26) Message boards : Number crunching : Initial Replication (Message 19271)
Posted 19 Mar 2008 by Brian Silvers
Post:
Extending from the same argument,

You can't do that, at Rosetta, the smallest amount of work a wu can do is a single iteration. You can set your requested run time very low I suppose, but it would always overshoot it.


So what is the smallest amount of time that you can allocate to make a "complete" WU over there?

You claim that this is not a complete wu, yet here, the wu's run for "n" trips around the LHC, where "n" is variable. What constitutes a "complete" wu here?


I see a distinction between that variability being done on the server vs. allowing all of us to control the variability, although I can see the basis for your question...


So what? In the other wu's I wasted my cycles, in that one, others will. If you look at it now, there are 3 results, all agree, no errors, credit granted. 2 machines out there are now crunching, for a long time, for a total waste of time, cycles and energy.


As I mentioned before (at least twice) and Richard has pointed out again, the mechanism for server-side aborts would handle this, but only "sort of". The caveats are that the hosts BOINC version needs to support the server-side request. Versions 5.8.16 and older do not support it. Also, it requires a scheduler contact for it to work, and unless they go with the less user-friendly option, it will not abort a task if it has already been started.

If you want to be a voice for change, IMO you need to pursue getting the server software updated first.
27) Message boards : Number crunching : Initial Replication (Message 19270)
Posted 19 Mar 2008 by Brian Silvers
Post:
The newer BOINC server software has a solution for that - outcome 221, 'Redundant result - cancelled by server'. It would work particularly well at LHC, because most hosts are contacting the server every 15 minutes while there's work around, and in general the late returns are high cache/low resource share, rather than extended crunching times (at least, for the 100,000 turn WUs they are). So send out IR=5, get the fastest three results back, and tell the other two they needn't bother, move on to the next one or to another project. The only thing that's wasted is a bit of download bandwidth.


I mentioned that ages ago. I also mentioned it again this time around, that people should push for the server software upgrade. The "battle" simply cannot be won without the software being updated. The only catch is that the version of BOINC on our machines have to support it, and versions like mine (5.8.16) and older do not support it.

So, humble suggestion to Adrian, Fat Loss (which, speaking of, I need to do), and others is to lobby for the software update FIRST and do so politely...
28) Message boards : Number crunching : Initial Replication (Message 19267)
Posted 18 Mar 2008 by Brian Silvers
Post:
At Rosetta, you can choose how long you want each work unit to run. What it does is downloads a start point and then runs as many iterations of the model as is possible to more or less complete and return the wu at your desired run time. In fact, the run time is "approximate" because the runtimes are rarely an exact multiple of the iteration for a particular protein, it evens out. That said, you can easily choose to run 1 hour wu's if you so wish.


Extending from the same argument, if the LHC tasks you process complete within 6 minutes, then you could run "6 WUs" for Rosetta, with each at 1 minute a piece. You could keep going into sub-minute or even sub-second tasks, assuming that Rosetta allows this.

I admit that I did/do not have a good understanding of Rosetta, as I haven't run the project at all, however it seems to me that it is not an "apples to apples" comparison, given the user-controlled variability that Rosetta provides which is not available here. Personally, no matter what Rosetta does, I don't consider that a "complete" WU, even if I do follow the concept that from the perspective of Rosetta it is a "complete" entity.



Point 2,...


That task demonstrates exactly what I was mentioning in regards to the server having no way of knowing if a given host is going to be an "early complete" or a "late complete". For the other tasks you listed, you were 4th or 5th in. This time, you're 2nd.

The deadlines are fairly short to begin with at 6.6ish days (I didn't do the exact math). It is the shortest deadline of the 4 projects I am attached to.

Given that my host takes 5 hours for the million turn results, for me to complete 5 hours in 6.6 days, I need to be running at least 46 minutes a day. At my current resource allocation of 4%, this means (in theory) that I have a maximum of 57.6 minutes per day that LHC gets. If I don't have BOINC running or if I'm doing other tasks, that amount of time goes down further. That being said, I handle resource allocations manually. I only have the percentages in there as a general guideline in case I'm in "unattended mode".

IOW, if I left BOINC alone, I could download one task, but take up the full deadline to report it. Is that preferable to you? Perhaps. Is it preferable to the project? Apparently not. Have I "wasted" the time spent? I can see both sides to that, but I think the actual fault is my resource allocation is too low. Of course, one never knows if my reporting of a task at 6.5 days might actually be the task that makes quorum because the other hosts errored out or haven't reported at all.

I said I was getting narked at LHC wasting cycles that could be being used for productive purposes. I stand by that.


I said that I thought that griping about this was silly, and I stand by that. It is so far down the scale, it isn't worth getting all bent out of shape over. In my view, if we were to use the hospital "pain scale" of 1-10 for this, I'd call it a 1. You apparently call it a 7 or 8, perhaps a 9. Just as in dealing with physical pain, each of us sees a situation as more or less troubling than another. It's just that I wouldn't pick this one problem as my major battle cry...

IMO, YMMV, etc, etc, etc...
29) Message boards : Number crunching : Initial Replication (Message 19265)
Posted 18 Mar 2008 by Brian Silvers
Post:
Since when has lobbying here got anything done?

Face it, the people here have enough CPU power for their needs, they don't need to do anything to keep crunchers happy.


Perhaps the difference is in the approach? From what I've seen, most attempts at advocating change have been linked to people hurling insults in the process. While I've surely been guilty of this same thing in the past, it's probably not the best approach...


I turn in a job that 4 others have already completed without error, mine is the fifth. I could have run 2 Rosetta wu's in that time,


You could've run 2 Rosetta tasks that completed within 26 minutes? I've never run Rosetta, but I find this unlikely. Maybe if you had the larger 1M turn tasks here perhaps, but not the 100K turn... Also, most tasks are issued around the same time, but there is no way for the server to know who is going to complete a task first. Sure, you could attempt to base it off of turnaround time, but that isn't a guarantee.

Essentially, the deal is you have a low resource allocation. So do I. You get a set of tasks at one shot, but the CPU scheduler on your side is timeslicing more, where as the other people may have higher resource allocations and/or lower caches, thus they complete the task, on average, before you do. This is where the server-side aborts would come into play, but that either requires a server-side upgrade or they wished to wait until the server-side upgrade was performed.

In any case, you have the tools to address the issue on your own. You could download tasks and check your local queue against the submitted results on the server and manually abort those that have met quorum and been validated.

Before you go moaning about how you "shouldn't have to", bear in mind this advice is given based on the fact that you claim it is irritating you and this method would be one way to address the concern about "wasting" cpu cycles. Sure, it asks you to be more engaged in the process, and you can rightfully say that you don't feel you should have to do so, but I'm pointing out that you do have a choice.
30) Message boards : Number crunching : Initial Replication (Message 19262)
Posted 18 Mar 2008 by Brian Silvers
Post:

Would LHC would miss those users any more than it misses those who refuse to crunch while IR > Q?


You fail to appreciate the fervor of those who demand that their resource allocations be honored even when viewed by very short timeframes. They are just as, if not more, ridiculously indignant as the people bellyaching about IR > Q. More than that, I am relatively certain that the people who demand resource allocation share honoring greatly outnumber those who are irate about IR > Q. The topic comes up frequently on SETI and Einstein, so much so that I was able to get the ball rolling on suggesting to the Einstein staff that they extend deadlines out some to relieve the pressure on older / slower / less dedicated hosts.

If one *must* tinker with the deadlines, then I'd suggest trimming no more than 1 day off of it.

Also, as an "unintended side-effect", since resource allocations are probably low for this project, you'll start seeing an increase of reporting that work was available, but BOINC determined that the host couldn't complete in time, which would again cause bellyaching about "fair distribution" because they were not informed that there was a change *AND* they don't see why they should have to make a change because it is *THEIR COMPUTER*, dammit!

Trust me, the deadlines are fine. People need to just quit their bitching about this and lobby to get the BOINC server-side components upgraded.
31) Message boards : Number crunching : Initial Replication (Message 19258)
Posted 17 Mar 2008 by Brian Silvers
Post:
Fast turnarounds can be achieved by IR > Q, but they can also be achieved (without wasting cycles) by short deadlines. It is a pity that LHC can't (or won't?) use that method instead.


OTOH, short deadlines lead to Earliest Deadline First / High Priority, which irritates the snot out of a large percentage of users who refuse to understand that their resource allocations are honored over the long term, just not on a second-by-second basis in lockstep with the switch interval. Those users end up grumbling about how the project is being "rude" and how they "want to have control of their computer instead of the project having control", etc, etc, etc...

Frankly, the project has had their say in regards to this. They said they would revisit the topic when the BOINC version on the server has been updated. The BOINC version has not yet been updated, thus this topic has not been revisited. If you want to push this issue, then what you should ask for are details about the progress of the BOINC server upgrade.

Beyond that, individuals who find this single subject such a catastrophic issue should probably heed the advice that was given and contribute their processing time to another project. Of all the things to get worked into a lather about, this is not one that I would choose... If you feel other projects are being "shorted", then vote with your clock cycles
32) Message boards : Number crunching : Maximum daily WU quota per CPU? (Message 19233)
Posted 14 Mar 2008 by Brian Silvers
Post:
i agree with this.
lets wait until the flow is constant.

but its funny to see that first everyone is complaining they don't get any. And now they complain because they can't get enough
:)


Most folks seem to forget that the reduced quota was intentionally done to stop the constant noise about "fair distribution". If the quota was increased again, all the fast systems will consume work at a faster pace, eventually leading back to people complaining about things not being "fair", that they didn't get any work while others have a lot.
33) Message boards : Number crunching : Maximum daily WU quota per CPU? (Message 19229)
Posted 14 Mar 2008 by Brian Silvers
Post:

I triple that notion of doubling the quota for a few days.


...and I, again, vote against it. The supply quickly ran dry today, which isn't surprising considering how I had several 1 Million turn tasks that ended in less than a second, then what I picked up today were the shorter tasks that normally run about 30 minutes that are ending in 6 minutes...

The quota should only be raised if there is a steady stream of work for more than just 2 days in a row. IMO, they need to be showing over 100K results "to crunch" sustained for a full week before even considering raising the quota...
34) Message boards : Number crunching : Initial replication and missing workunit (Message 19201)
Posted 11 Mar 2008 by Brian Silvers
Post:
I was not certain enough yesterday to contradict what povaddict has said, and I'm still not certain enough today due to not tracking down what is tickling my brain into thinking what I'm thinking, but I thought that there was a capability within the server-side aborts to do the abort unconditionally. The conditional "abort the task if it is not running" was the "user-friendly" option... I could be mistaken about this though. IIRC, it was said over on the SETI message boards and I think it was either by John Mcleod VII, Josef Segur, or Ingleside...

Yes. If the server notices the client has a workunit that was completely aborted by the admin, or a workunit that has already expired (user didn't return it on time) and even got validated from other results, it will send an "abort now, no matter if it's running or not".

However, *users* can't choose if they want an "abort even if running" instead of "abort if not started" in the common case that a workunit reaches quorum normally.


Understood that... It's just that a project can choose to do unconditional aborts as well as the "polite" version... ;-)

Also, BOINC versions 5.8.16 and older do not support the server-side aborts...
35) Message boards : Number crunching : Initial replication and missing workunit (Message 19197)
Posted 10 Mar 2008 by Brian Silvers
Post:
The abort mechanism only aborts ready-to-run workunits. Running workunits only get aborted if there is no way for them to get credit (ie. if they are WAY too late).

Yes, it is a pity there isn't a client-side option to allow the abort of running WUs if Q has been achieved (or for any other reason the server decides). This would allow crunchers who don't care about credits not to waste CPU time. Perhaps BOINC 6 will bring this...


I was not certain enough yesterday to contradict what povaddict has said, and I'm still not certain enough today due to not tracking down what is tickling my brain into thinking what I'm thinking, but I thought that there was a capability within the server-side aborts to do the abort unconditionally. The conditional "abort the task if it is not running" was the "user-friendly" option... I could be mistaken about this though. IIRC, it was said over on the SETI message boards and I think it was either by John Mcleod VII, Josef Segur, or Ingleside...
36) Message boards : Number crunching : What's with the tiny work units? (Message 18999)
Posted 9 Feb 2008 by Brian Silvers
Post:

So my guess is that these new wu's are much more technical than those from a year or 2 ago, but proportional in relation to current hardware.

My dual xeon 3.0 ghz /800fsb wants 15 hours vs. a dual core 6700 at 4.5hours per wu.



As others have mentioned, it is a higher number of turns around the track. It is precisely 10x the amount of the previous tasks. If it took me 33 minutes before, it now takes 330 minutes...within a few minutes... There really isn't any more detail in the application (we'd have to have a new app version to get that), so there ya have it...
37) Message boards : Number crunching : What's with the tiny work units? (Message 18993)
Posted 8 Feb 2008 by Brian Silvers
Post:
I find this thread to be........quite odd.....considering I'm used to seeing tasks that take only 30 minutes on my system and the 4 that I got over the past day are taking 4-5 hours a piece...

I was going to come here and ask "What's with the BIG work units?"

Oh, the irony.....
38) Message boards : Number crunching : New Year's Present (Message 18789)
Posted 2 Jan 2008 by Brian Silvers
Post:
Happy new year all.
Got 2 or 4 wu's on all my boxen....woots.


What's the plural of moose?
39) Message boards : Number crunching : It's raining LHC WU's - I love it ! (Message 18739)
Posted 19 Dec 2007 by Brian Silvers
Post:
Hm, the batches yesterday and before had about a 10% ratio of those 0.x Seconds WorkUnits.

Today, I'm seeing exclusively those, also seems those don't go smooth somehow (the total number of "WorkUnits in progress" has barely been sinking).

One would expect those to run down these numbers within a few hours. Looking at the server status, the number jumps back up all the time, as if these 0 Second WorkUnits didn't really count correctly when finished (?)

...needless to say, these 0's have piled up in the Pending Credit list quite a bit (almost 400).


I believe they generated more during the day today, which is why you may not have seen the "in progress" go down by much. I'm busy with Cosmology and Einstein, so I didn't bother to try to catch any of the batch...

Anyway, since you're here, why did you have your entire AMD fleet offline for a while? I had looked to do some comparisons over on Einstein for S5R3, and I notice you've removed the AMD X2 4400+ systems you had...
40) Message boards : Number crunching : Maximum daily WU quota per CPU? (Message 18702)
Posted 14 Dec 2007 by Brian Silvers
Post:
Bigger limit now? I have 22 tasks on a dual core machine.


Not sure if you're asking if there is a bigger limit, or if you are asking FOR a bigger limit...

If you were asking if there has been an increase: No, the limit is still 10.

If you are asking for an increase: From what I can see, the ready to send queue drains very quickly. I would not be in favor of an increase in the quota because it will mean that the queue would be drained even faster, thus leading us back into the whinging about "fair distribution".

What needs to be addressed FIRST is getting a steady stream of work going. If this project is not going to be able to do that, then there is no sense in upping the quota because the work is able to be done fairly quickly by the current participants at the current quota / distribution level.

IMO, YMMV, etc, etc, etc...


Previous 20 · Next 20


©2024 CERN