Message boards : Number crunching : Initial Replication
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 4 · 5 · 6 · 7

AuthorMessage
J Langley

Send message
Joined: 31 Dec 05
Posts: 68
Credit: 8,691
RAC: 0
Message 19263 - Posted: 18 Mar 2008, 12:38:27 UTC - in response to Message 19262.  

You fail to appreciate the fervor of those who demand that their resource allocations be honored even when viewed by very short timeframes.

Maybe. Perhaps they would be a little calmer if the standard BOINC client provided any easy way to see the short-term and long-term debt of each attached project?

People need to just quit their bitching about this and lobby to get the BOINC server-side components upgraded.


I agree that bitching about IR > Q is pointless (given the admins' explanations). I don't think there is much point asking for a server upgrade either though - Alex and Neasan aren't on this project full-time (not their fault, but a strange decision (in my opinion) by those in charge when LHC moved to QM), and have said they will get round to this when they can. (I guess we have waited so long now, we might as well wait for BOINC 6 anyway.)
ID: 19263 · Report as offensive     Reply Quote
Profile adrianxw

Send message
Joined: 29 Sep 04
Posts: 187
Credit: 705,487
RAC: 0
Message 19264 - Posted: 18 Mar 2008, 12:59:02 UTC
Last modified: 18 Mar 2008, 13:06:54 UTC

Since when has lobbying here got anything done?

Face it, the people here have enough CPU power for their needs, they don't need to do anything to keep crunchers happy.

I have kept a percentage open to LHC all along, but the IR issue is narking me as well. I turn in a job that 4 others have already completed without error, mine is the fifth. I could have run 2 Rosetta wu's in that time, (and got more credit for it, (can of worms alert)).

If there is still an Intel/AMD issue, that is again an HR issue that the project could sort if they chose.

Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 19264 · Report as offensive     Reply Quote
Brian Silvers

Send message
Joined: 3 Jan 07
Posts: 124
Credit: 7,065
RAC: 0
Message 19265 - Posted: 18 Mar 2008, 18:56:03 UTC - in response to Message 19264.  

Since when has lobbying here got anything done?

Face it, the people here have enough CPU power for their needs, they don't need to do anything to keep crunchers happy.


Perhaps the difference is in the approach? From what I've seen, most attempts at advocating change have been linked to people hurling insults in the process. While I've surely been guilty of this same thing in the past, it's probably not the best approach...


I turn in a job that 4 others have already completed without error, mine is the fifth. I could have run 2 Rosetta wu's in that time,


You could've run 2 Rosetta tasks that completed within 26 minutes? I've never run Rosetta, but I find this unlikely. Maybe if you had the larger 1M turn tasks here perhaps, but not the 100K turn... Also, most tasks are issued around the same time, but there is no way for the server to know who is going to complete a task first. Sure, you could attempt to base it off of turnaround time, but that isn't a guarantee.

Essentially, the deal is you have a low resource allocation. So do I. You get a set of tasks at one shot, but the CPU scheduler on your side is timeslicing more, where as the other people may have higher resource allocations and/or lower caches, thus they complete the task, on average, before you do. This is where the server-side aborts would come into play, but that either requires a server-side upgrade or they wished to wait until the server-side upgrade was performed.

In any case, you have the tools to address the issue on your own. You could download tasks and check your local queue against the submitted results on the server and manually abort those that have met quorum and been validated.

Before you go moaning about how you "shouldn't have to", bear in mind this advice is given based on the fact that you claim it is irritating you and this method would be one way to address the concern about "wasting" cpu cycles. Sure, it asks you to be more engaged in the process, and you can rightfully say that you don't feel you should have to do so, but I'm pointing out that you do have a choice.
ID: 19265 · Report as offensive     Reply Quote
Profile adrianxw

Send message
Joined: 29 Sep 04
Posts: 187
Credit: 705,487
RAC: 0
Message 19266 - Posted: 18 Mar 2008, 20:41:42 UTC
Last modified: 18 Mar 2008, 20:47:59 UTC

I see no insults being hurled by me in my post? However...
You could've run 2 Rosetta tasks that completed within 26 minutes? I've never run Rosetta, but I find this unlikely.

You start by calling me a liar.

At Rosetta, you can choose how long you want each work unit to run. What it does is downloads a start point and then runs as many iterations of the model as is possible to more or less complete and return the wu at your desired run time. In fact, the run time is "approximate" because the runtimes are rarely an exact multiple of the iteration for a particular protein, it evens out. That said, you can easily choose to run 1 hour wu's if you so wish.

Point 2, this wu is one of the ones that arrived yesterday. My machine completed the wu in 17,677 seconds, and is currently the fastest machine that has completed it. A little maths shows that to be 4.9+ hours. I could therefore have completed ~4.9 1 hour Rosetta wu's in the same time. I also note that one of the last wu's you crunched was longer then that I link to now.

Here is the wu summary in case it disappears.


I do, indeed, have the ability to micromanage the 2 super quads I have here at home, but I have other machines at remote locations that I visit infrequently. LHC is not attached there.

I said I was getting narked at LHC wasting cycles that could be being used for productive purposes. I stand by that. All of the wu's I have had recently have been crunched to completion without error by all of the machines they have been sent to. A IR = Q would have been totally adequate.

I am left to ponder whether to continue having a percentage allocated to LHC at all.

The first line of my last post states the situation, the project doesn't NEED to do anything.

Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 19266 · Report as offensive     Reply Quote
Brian Silvers

Send message
Joined: 3 Jan 07
Posts: 124
Credit: 7,065
RAC: 0
Message 19267 - Posted: 18 Mar 2008, 21:22:07 UTC - in response to Message 19266.  

At Rosetta, you can choose how long you want each work unit to run. What it does is downloads a start point and then runs as many iterations of the model as is possible to more or less complete and return the wu at your desired run time. In fact, the run time is "approximate" because the runtimes are rarely an exact multiple of the iteration for a particular protein, it evens out. That said, you can easily choose to run 1 hour wu's if you so wish.


Extending from the same argument, if the LHC tasks you process complete within 6 minutes, then you could run "6 WUs" for Rosetta, with each at 1 minute a piece. You could keep going into sub-minute or even sub-second tasks, assuming that Rosetta allows this.

I admit that I did/do not have a good understanding of Rosetta, as I haven't run the project at all, however it seems to me that it is not an "apples to apples" comparison, given the user-controlled variability that Rosetta provides which is not available here. Personally, no matter what Rosetta does, I don't consider that a "complete" WU, even if I do follow the concept that from the perspective of Rosetta it is a "complete" entity.



Point 2,...


That task demonstrates exactly what I was mentioning in regards to the server having no way of knowing if a given host is going to be an "early complete" or a "late complete". For the other tasks you listed, you were 4th or 5th in. This time, you're 2nd.

The deadlines are fairly short to begin with at 6.6ish days (I didn't do the exact math). It is the shortest deadline of the 4 projects I am attached to.

Given that my host takes 5 hours for the million turn results, for me to complete 5 hours in 6.6 days, I need to be running at least 46 minutes a day. At my current resource allocation of 4%, this means (in theory) that I have a maximum of 57.6 minutes per day that LHC gets. If I don't have BOINC running or if I'm doing other tasks, that amount of time goes down further. That being said, I handle resource allocations manually. I only have the percentages in there as a general guideline in case I'm in "unattended mode".

IOW, if I left BOINC alone, I could download one task, but take up the full deadline to report it. Is that preferable to you? Perhaps. Is it preferable to the project? Apparently not. Have I "wasted" the time spent? I can see both sides to that, but I think the actual fault is my resource allocation is too low. Of course, one never knows if my reporting of a task at 6.5 days might actually be the task that makes quorum because the other hosts errored out or haven't reported at all.

I said I was getting narked at LHC wasting cycles that could be being used for productive purposes. I stand by that.


I said that I thought that griping about this was silly, and I stand by that. It is so far down the scale, it isn't worth getting all bent out of shape over. In my view, if we were to use the hospital "pain scale" of 1-10 for this, I'd call it a 1. You apparently call it a 7 or 8, perhaps a 9. Just as in dealing with physical pain, each of us sees a situation as more or less troubling than another. It's just that I wouldn't pick this one problem as my major battle cry...

IMO, YMMV, etc, etc, etc...
ID: 19267 · Report as offensive     Reply Quote
Profile adrianxw

Send message
Joined: 29 Sep 04
Posts: 187
Credit: 705,487
RAC: 0
Message 19268 - Posted: 19 Mar 2008, 9:45:51 UTC
Last modified: 19 Mar 2008, 9:51:31 UTC

Extending from the same argument,

You can't do that, at Rosetta, the smallest amount of work a wu can do is a single iteration. You can set your requested run time very low I suppose, but it would always overshoot it. You claim that this is not a complete wu, yet here, the wu's run for "n" trips around the LHC, where "n" is variable. What constitutes a "complete" wu here? The beam might collapse 2 circuits after the wu was set to run for. Same at Rosetta, a "complete" wu would be when every possible protein conformation had been checked, that would take forever and is why we use DC in the first place.
For the other tasks you listed, you were 4th or 5th in. This time, you're 2nd.

So what? In the other wu's I wasted my cycles, in that one, others will. If you look at it now, there are 3 results, all agree, no errors, credit granted. 2 machines out there are now crunching, for a long time, for a total waste of time, cycles and energy.

The other is not at issue. Sure, if I had a big resource share and the work was thick and fast, but neither is the case.

And yes, I do consider it a pretty poor show when a resource contributed by volunteers is wasted.

Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 19268 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 27 Oct 07
Posts: 186
Credit: 3,297,640
RAC: 0
Message 19269 - Posted: 19 Mar 2008, 10:04:18 UTC - in response to Message 19268.  
Last modified: 19 Mar 2008, 10:05:32 UTC

So what? In the other wu's I wasted my cycles, in that one, others will. If you look at it now, there are 3 results, all agree, no errors, credit granted. 2 machines out there are now crunching, for a long time, for a total waste of time, cycles and energy.

The other is not at issue. Sure, if I had a big resource share and the work was thick and fast, but neither is the case.

And yes, I do consider it a pretty poor show when a resource contributed by volunteers is wasted.

The newer BOINC server software has a solution for that - outcome 221, 'Redundant result - cancelled by server'. It would work particularly well at LHC, because most hosts are contacting the server every 15 minutes while there's work around, and in general the late returns are high cache/low resource share, rather than extended crunching times (at least, for the 100,000 turn WUs they are). So send out IR=5, get the fastest three results back, and tell the other two they needn't bother, move on to the next one or to another project. The only thing that's wasted is a bit of download bandwidth.

Now all we have to do it to persuade Alex and Neasan's bosses to let them do what they want to do anyway, and update the server code (see message 19057).
ID: 19269 · Report as offensive     Reply Quote
Brian Silvers

Send message
Joined: 3 Jan 07
Posts: 124
Credit: 7,065
RAC: 0
Message 19270 - Posted: 19 Mar 2008, 14:40:30 UTC - in response to Message 19269.  

The newer BOINC server software has a solution for that - outcome 221, 'Redundant result - cancelled by server'. It would work particularly well at LHC, because most hosts are contacting the server every 15 minutes while there's work around, and in general the late returns are high cache/low resource share, rather than extended crunching times (at least, for the 100,000 turn WUs they are). So send out IR=5, get the fastest three results back, and tell the other two they needn't bother, move on to the next one or to another project. The only thing that's wasted is a bit of download bandwidth.


I mentioned that ages ago. I also mentioned it again this time around, that people should push for the server software upgrade. The "battle" simply cannot be won without the software being updated. The only catch is that the version of BOINC on our machines have to support it, and versions like mine (5.8.16) and older do not support it.

So, humble suggestion to Adrian, Fat Loss (which, speaking of, I need to do), and others is to lobby for the software update FIRST and do so politely...
ID: 19270 · Report as offensive     Reply Quote
Brian Silvers

Send message
Joined: 3 Jan 07
Posts: 124
Credit: 7,065
RAC: 0
Message 19271 - Posted: 19 Mar 2008, 14:51:13 UTC - in response to Message 19268.  

Extending from the same argument,

You can't do that, at Rosetta, the smallest amount of work a wu can do is a single iteration. You can set your requested run time very low I suppose, but it would always overshoot it.


So what is the smallest amount of time that you can allocate to make a "complete" WU over there?

You claim that this is not a complete wu, yet here, the wu's run for "n" trips around the LHC, where "n" is variable. What constitutes a "complete" wu here?


I see a distinction between that variability being done on the server vs. allowing all of us to control the variability, although I can see the basis for your question...


So what? In the other wu's I wasted my cycles, in that one, others will. If you look at it now, there are 3 results, all agree, no errors, credit granted. 2 machines out there are now crunching, for a long time, for a total waste of time, cycles and energy.


As I mentioned before (at least twice) and Richard has pointed out again, the mechanism for server-side aborts would handle this, but only "sort of". The caveats are that the hosts BOINC version needs to support the server-side request. Versions 5.8.16 and older do not support it. Also, it requires a scheduler contact for it to work, and unless they go with the less user-friendly option, it will not abort a task if it has already been started.

If you want to be a voice for change, IMO you need to pursue getting the server software updated first.
ID: 19271 · Report as offensive     Reply Quote
Profile adrianxw

Send message
Joined: 29 Sep 04
Posts: 187
Credit: 705,487
RAC: 0
Message 19272 - Posted: 19 Mar 2008, 17:07:38 UTC
Last modified: 19 Mar 2008, 17:39:32 UTC

Yes, I am aware of the features of the newer servers, and yes, I have perhaps not the absolute latest BOINC core on all machines, but within 1 or 2 roemmended minors, 5.10.28 on this one, 5.10.45 on my Vista system etc. It brings me back to the first sentence of the first post I added to this thread...

Since when has lobbying acheived anything around here?

... and I've been around here for a while.

I'd vote with my feet, but that isn't going to send a message to the admins. When they have as many CPU's out here as they need, and everyone fighting to get the work when it comes, a few dozen, hundred, maybe thousand leaving the project is un-noticed.

To answer your Rosetta question, I just hopped onto Rosetta and edited my preferences to see what I could and couldn't set for my run time. The shortest is 1 hour, the longest 1 day [sic] i.e. not 24 hours, (I wonder what happens at daylight savings start/end...).

As I said before, there is no such thing as a "complete" work unit if you mean to "fold protein x". The way the algorithm works is that an amino acid, (AA), sequence is sent with a random, (or maybe a guess based on other information), start configuration.

The configuration of the AA residues, (angles/rotations between individual AA residues), is then altered and the overall free energy of the resulting fold compared to the original, if it is better, another change is done, if worse, it is undone and a different random change is done. This process continues until the lowest energy configuration has been found for that AA sequence, that start point and a random first move. That constitutes a "decoy" in Rosetta terminology. The Rosetta server side then compares the decoys and issues new start points, or takes other actions depending on what ios found.

How long it takes depends on the length of the AA sequence, (a few tens, to many hundred AA's), and the idiosyncracies of the individual residues, (some residues are versatile, others really difficult to work with). Hence my comment before, it will always calculate at least one decoy.

What it then does is looks at how long that decoy took to find and makes a decision based on a number of criteria to see if it believes it has time to do another within the user chosen run time, if yes, it starts again and finds another decoy, if not, it says "finished" and uploads.

In that respect, a single decoy, (and some decoys can take over an hour to calculate), is as much a full wu as anywhere else, it depends on a projects definition of a wu. At Rosetta it is just that a wu may contain more then one decoy, this saves network bandwidth for those that have such constraints.

The choice of the original configuration depends a lot on what is already known about the protein or proteins of similar formula. Of course, if it is a totally unknown protein, the ab-initio fold start point is random - it can't be anything else.

If you need to know anything more about Rosetta just ask, (I'd say PM me, but of course, that would require an upgrade to the server software...).


Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 19272 · Report as offensive     Reply Quote
Brian Silvers

Send message
Joined: 3 Jan 07
Posts: 124
Credit: 7,065
RAC: 0
Message 19273 - Posted: 19 Mar 2008, 17:39:56 UTC - in response to Message 19272.  
Last modified: 19 Mar 2008, 17:41:24 UTC


Since when has lobbying acheived anything around here?


Based on the post Richard linked to, it would seem that lobbying Alex and Neasan will not have a direct impact as all they're going to be able to do is mention it to their boss. Looking at it from their perspective, eventually you have to decide whether or not you wish to annoy someone again if they keep telling you "no".

So, what some industrious soul might consider doing is trying to figure out who are likely candidates for the roadblocks and then send them a polite email or make a polite phone call to them, if their number is listed on the university web page...

I'd vote with my feet, but that isn't going to send a message to the admins.


As said, based on the post from Neasan, the "admins" (Alex and Neasan) aren't the roadblock, thus the message needs to go to beyond them...
ID: 19273 · Report as offensive     Reply Quote
Profile AstralWalker

Send message
Joined: 30 Nov 05
Posts: 14
Credit: 1,746,819
RAC: 0
Message 19274 - Posted: 19 Mar 2008, 17:57:03 UTC - in response to Message 19269.  

The newer BOINC server software has a solution for that - outcome 221, 'Redundant result - cancelled by server'. It would work particularly well at LHC, because most hosts are contacting the server every 15 minutes while there's work around, and in general the late returns are high cache/low resource share, rather than extended crunching times (at least, for the 100,000 turn WUs they are).

The problem with that is once the WU starts on your system, even for a few seconds, it will not be aborted. At least that's how I understand it.

The length you select for Rosetta WUs essentially determines the number of steps to be executed for particular proteins. This isn't much different than the number of turns in an LHC WU I suppose. With Rosetta, 3 100 step WUs is the same as 1 300 step WU so the number of WUs being calculated, or not being calculated, is not really relevant.

Anyway, I've never been particularly concerned about the IR>Q issue. While I understand that IR>Q has a waste component, wasted cycles are just the byproduct of quality control. Who's to say that there aren't other forms of redundancy (i.e. wasted cycles) built into the WUs of this or other projects? I'm not going to worry about it for now.

I am, though, one of those people that gets irritated by short deadline projects. Resource shares do not work well in the long run, and are even worse in the short term, at least for me, as I change my shares fairly often. The team I'm on has a project of the month and there are other competitions with projects I like such as the recent challenge at Prime Grid and the current one at WCG, and I use resource shares to try and get my machine to crunch those projects with higher priority.

But LHC is always #1 with me. :)

My main computer only connects to the internet sporadically and BOINC becomes very uncaring about resource shares as it will seemingly randomly select the first project it contacts with a share of 1 (LHC is 1000) and then proceed to download 50 WUs. Then it will contact Malaria and download another block of WUs with a short deadline so that the other WUs can't finish on time as BOINC still can't figure out how long a WU will take after months and years of running the project. It's nice to say how BOINC is supposed to work but in this area it does not run as advertised.

So as I get off my soapbox my conclusion is that the real problems with wasted cycles have more to do with how BOINC works than the individual projects.
ID: 19274 · Report as offensive     Reply Quote
Profile adrianxw

Send message
Joined: 29 Sep 04
Posts: 187
Credit: 705,487
RAC: 0
Message 19275 - Posted: 19 Mar 2008, 18:43:24 UTC

it would seem that lobbying Alex and Neasan will not have a direct impact as all they're going to be able to do is mention it to their boss.

... and thus it has been with QM running the show, and was before when it was run from Geneva. As I said, I have been around here for a long time, a lot longer then you possibly, and with a project like CERN, the politics and empires take precedence over actually doing anything. Hence my one liner...

Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 19275 · Report as offensive     Reply Quote
Brian Silvers

Send message
Joined: 3 Jan 07
Posts: 124
Credit: 7,065
RAC: 0
Message 19276 - Posted: 19 Mar 2008, 19:16:57 UTC - in response to Message 19275.  
Last modified: 19 Mar 2008, 19:20:34 UTC

it would seem that lobbying Alex and Neasan will not have a direct impact as all they're going to be able to do is mention it to their boss.

... and thus it has been with QM running the show, and was before when it was run from Geneva. As I said, I have been around here for a long time, a lot longer then you possibly, and with a project like CERN, the politics and empires take precedence over actually doing anything. Hence my one liner...


OK, so if that is the case, then what? As I see it, the only options are to continue on, armed with the knowledge, or to quit. If you continue on, then you could manually abort tasks if it is a great enough concern to you. Alternatively, you could just blanket abort all short-duration tasks and opt to only work on the longer duration tasks, since you have a faster system and the effects of lower resource allocation would not be as evident with those...meaning your faster system could help offset the lower allocation due to being able to complete tasks quicker, while people with higher allocations or lower caches that have slower systems would get bogged down more...taking longer to complete and report...

In any case, I hope to not see a return to the invective that went on in the past in this thread. I'm not accusing you of that, it's just this thread was very much forgotten by most folks until the OP dredged it up again... It is an issue, but I personally can't get all worked up about it like some apparently can...
ID: 19276 · Report as offensive     Reply Quote
Profile adrianxw

Send message
Joined: 29 Sep 04
Posts: 187
Credit: 705,487
RAC: 0
Message 19277 - Posted: 19 Mar 2008, 20:02:43 UTC
Last modified: 19 Mar 2008, 20:42:34 UTC

And now a forth machine has crunched the unit. 31,000+ wasted crunching.

Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 19277 · Report as offensive     Reply Quote
Brian Silvers

Send message
Joined: 3 Jan 07
Posts: 124
Credit: 7,065
RAC: 0
Message 19281 - Posted: 19 Mar 2008, 23:10:16 UTC - in response to Message 19277.  

And now a forth machine has crunched the unit. 31,000+ wasted crunching.


I'm not trying to be overly argumentative here, but there is a flaw in your logic of blanket assuming that it was "wasted"...

That WU reached quorum at 19 Mar 2008 2:30:18 UTC. The fourth host reported at 19 Mar 2008 16:03:19 UTC. The host also shows that they made a scheduler contact at 19 Mar 2008 13:45:36 UTC, which was after quorum on the WU in question. While we cannot unequivocably state that the fourth system didn't contact the scheduler between then and the last time they downloaded a task, which was 17 Mar 2008 21:06:46 UTC, the potential exists for that host to have had no way of knowing that the result they had was no longer needed, nor would a scheduler contact been able to inform them of it, since the potential exists that a contact was made prior to quorum / validation when all was still "ok", then the next contact was not until after validation, but the host had already begun work on the task. IOW, this situation is one that could "fall through the cracks" no matter what is done with the server-side aborts.

Now, this could perhaps strengthen the argument for setting MQ=IR and assigning extra replications as needed, but the project has explicitly stated that they discussed IR=5 at the time and felt it best suited their needs.

Is it possible that if one of the 3 that were in at 19 Mar 2008 2:30:18 UTC did not validate that an extra replication at that point could've been issued and validated in the same amount of time or even less? Sure, it's possible. Is it likely? Who knows...? It is possible though that the fourth task could've gone to a lower powered box with an even smaller resource allocation, thus lengthening the time needed to get another result in and see if it validates. Then there's the possibility of yet another reissue if the fourth result fails to validate and/or it never gets turned in.

Essentially, the project did perform some sort of "due diligence" when setting IR=5, as they could've gone to IR=10. The "extra two" replications are a bit of "insurance". I see both sides to this, that's why I'm saying I just can't get all worked up into a lather like some people do. It's just not that big of a deal to me. More than that, I can see the possibility that it indeed could take longer to get the WU validated if you do IR=MQ, where apparently some of you think that "gee, it can't possibly be that much of an additional wait"...

All in all, I see this as being upset about a relatively minor issue. I'm not saying that it isn't an issue, just that it is not something that "requires" a lot of telling people that think like me that it's not that big of an issue that they are idiots or whatnot. No, you didn't do that, but it was done in this thread and it is still there for all to see...

Anyway, I'm exiting this discussion again.
ID: 19281 · Report as offensive     Reply Quote
Profile adrianxw

Send message
Joined: 29 Sep 04
Posts: 187
Credit: 705,487
RAC: 0
Message 19282 - Posted: 20 Mar 2008, 9:01:36 UTC

IOW, this situation is one that could "fall through the cracks" no matter what is done with the server-side aborts.

But if IR = Q it would never have been sent to 2 of the 5 machines. Thus 2 of the machines have crunched the wu for no good reason. If any of the original replication machines had failed, then send it out again. A lot of projects do that, often with a shorter then usual deadline. Some, SIMAP for example, send failed units out again to "known good hosts" that have a proven track record of returning good work promptly, (I receive a good number of those myself).

I do consider it to be important as I consider the squandering of a freely given resource which could have been doing something more productive to be, at best, arrogant, others would probably choose other terminology.

You and I are unlikely ever to agree on this point. You have behaved in a generally civil manner however and I thank you for the debate.



Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 19282 · Report as offensive     Reply Quote
Brian Silvers

Send message
Joined: 3 Jan 07
Posts: 124
Credit: 7,065
RAC: 0
Message 19284 - Posted: 20 Mar 2008, 10:00:54 UTC - in response to Message 19282.  
Last modified: 20 Mar 2008, 10:06:12 UTC


But if IR = Q it would never have been sent to 2 of the 5 machines. Thus 2 of the machines have crunched the wu for no good reason.


That is true, if and only if the first 3 were all "strongly similar" (a validation match). This harkens way back to the discussion in this same thread about failure rates on the initial replication. Just because 3 are sent out does not mean that 3 "good" results will come back on the first pass. At the time the decision was made, the server side abort capability didn't exist as it is today. Based on that and the construction goals, this was a valid decision, IMO. Your position is along the lines of "well, if the first 3 are not a match, surely it can't take that long to send out more until you get a matching set". Their position was, "Why don't we try to increase the odds of a matching set on the first pass, given our construction deadlines?"

As someone else mentioned (Keck?), when the LHC is up and running, it is likely that a different app other than SixTrack will be used.

What I guess I'm getting at is so long as the server code is old, this is akin to beating a dead and at least partially decomposed horse. Adding insult to injury upon that is the fact that the science application is being used for construction and fine-tuning purposes which have a need to be reported sooner rather than later.

On the one hand, I hear the reasoning that you could go with shorter deadlines. However, not only does a shorter deadline put deadline pressure on the clients, it also mucks around with the scheduling mechanism in BOINC. People get real bent out of shape about science application X "hogging" their system, so it is conceivable that what a shorter deadline will do is simply shift your "pain" to someone else, and we'll have a new and different vocal group complaining about how the deadlines are too short and are unfair.

If any of the original replication machines had failed, then send it out again. A lot of projects do that, often with a shorter then usual deadline. Some, SIMAP for example, send failed units out again to "known good hosts" that have a proven track record of returning good work promptly, (I receive a good number of those myself).


Not a bad idea, but needs a rework of the server-side code to be supported here, which however is first in need of an upgrade, thus the server-side upgrade is the choke point again. In lieu of all these "what if" alternate methods you and others propose, what is in place is the "best" implementation for what they need to get done, IMO, and in the project's opinion as well.


You and I are unlikely ever to agree on this point. You have behaved in a generally civil manner however and I thank you for the debate.


I'm not saying you're wrong, I'm just saying that given the circumstances, what we have is not really that bad. You and others seem to be looking at the issue through a microscope instead of backing up and looking at all the other factors surrounding the issue.

In summary, is the additional replication "wasteful"? Sometimes, perhaps, but not always. Could it be better? Of course it could, but given the circumstances, it is probably the best solution to meet the project's needs with the minimum amount of impact to those of you who feel as you do (as I pointed out, they could've had IR=10 or even higher)...
ID: 19284 · Report as offensive     Reply Quote
Previous · 1 . . . 4 · 5 · 6 · 7

Message boards : Number crunching : Initial Replication


©2024 CERN