61) Message boards : Number crunching : Stand back....... (Message 18399)
Posted 27 Oct 2007 by Brian Silvers
Post:
I say:
"Open the floodgates! Let's crunch these freaking numbers!!"


The quota needs to be something reasonable. If it is not, then you will have all these gripes and complaints about people hoarding stuff for themselves. Nevermind that no project guarantees all participants work at all times.

A reasonable quota to start with would be 8/core (4 core max). If there is sufficient workload behind all this, then doubling that to 16/core would keep most everyone happy, I would think...

One thing that needs to be considered by us volunteers though are our resource allocations. I'm currently only at 1% for this project. I'm turning stuff in much quicker than I should be because I'm suspending other work while my 4/day are worked on... I set this project at 1% some time ago because I felt it would be a backup in case SETI and Einstein were both down. If the project is going to have a continual stream of work, then I need to make some sort of decision as to exactly how much time I wish to give to this project...

Brian
62) Message boards : Number crunching : Server not reporting "No Work from project" (Message 18397)
Posted 27 Oct 2007 by Brian Silvers
Post:
It would be nice if BOINC manager would always set the next try for 30 minutes, 18 seconds... but that's a subject for another thread.

Thanks again, Brian. I'm using the "no new tasks" technique and setting my wrist watch for 31 minutes... so simple, yet so smart.


It depends on various conditions (that I don't know/understand) as to what delay time is set for the next comm attempt. IMO, with the current work quota set so low, I don't see the need for the backoff anyway... I would think that the current 30 minute policy probably increased the load on the scheduler.

Edit: OK, I figure perhaps the 30-minute backoff was to try to get reporting done before forcing the 24-hour backoff. Dunno what to say. I'll check to see what happens with the 4 units I pick up over the next 2 hours, but I think the 30-minute rule doesn't get the final 2 results reported until tomorrow unless I hit the update button myself...

Brian
63) Message boards : Number crunching : Server not reporting "No Work from project" (Message 18393)
Posted 27 Oct 2007 by Brian Silvers
Post:
Your issue is one where the server *IS* reporting no work, so it might be beneficial to start a new post instead of using this thread...

FYI, the current minimum time between scheduler contacts is a little over 30 minutes, so you need to simply set that box to no new tasks for 31 minutes and then try again...

Brian

One of my PC's, the VISTA box using 5.10.20, is getting this message stream when left unattended and looking for work:

10/27/2007 1:05:23 PM|lhcathome|Message from server: Not sending work - last request too recent: 65 sec
10/27/2007 1:05:23 PM|lhcathome|Deferring communication for 1 min 0 sec
10/27/2007 1:05:23 PM|lhcathome|Reason: no work from project
10/27/2007 1:06:24 PM|lhcathome|Sending scheduler request: To fetch work
10/27/2007 1:06:24 PM|lhcathome|Requesting 1199 seconds of new work
10/27/2007 1:06:29 PM|lhcathome|Scheduler RPC succeeded [server version 505]
10/27/2007 1:06:29 PM|lhcathome|Message from server: Not sending work - last request too recent: 65 sec
10/27/2007 1:06:29 PM|lhcathome|Deferring communication for 1 min 44 sec
10/27/2007 1:06:29 PM|lhcathome|Reason: no work from project
10/27/2007 1:08:14 PM|lhcathome|Sending scheduler request: To fetch work
10/27/2007 1:08:14 PM|lhcathome|Requesting 1199 seconds of new work
10/27/2007 1:08:19 PM|lhcathome|Scheduler RPC succeeded [server version 505]
10/27/2007 1:08:19 PM|lhcathome|Message from server: Not sending work - last request too recent: 111 sec
10/27/2007 1:08:19 PM|lhcathome|Deferring communication for 1 min 0 sec
10/27/2007 1:08:19 PM|lhcathome|Reason: no work from project
10/27/2007 1:09:24 PM|lhcathome|Sending scheduler request: To fetch work
10/27/2007 1:09:24 PM|lhcathome|Requesting 1199 seconds of new work
10/27/2007 1:09:29 PM|lhcathome|Scheduler RPC succeeded [server version 505]
10/27/2007 1:09:29 PM|lhcathome|Message from server: Not sending work - last request too recent: 71 sec
10/27/2007 1:09:29 PM|lhcathome|Deferring communication for 5 min 16 sec
10/27/2007 1:09:29 PM|lhcathome|Reason: no work from project
10/27/2007 1:14:50 PM|lhcathome|Sending scheduler request: To fetch work
10/27/2007 1:14:50 PM|lhcathome|Requesting 1200 seconds of new work
10/27/2007 1:14:55 PM|lhcathome|Scheduler RPC succeeded [server version 505]
10/27/2007 1:14:55 PM|lhcathome|Message from server: Not sending work - last request too recent: 325 sec
10/27/2007 1:14:55 PM|lhcathome|Deferring communication for 19 min 10 sec
10/27/2007 1:14:55 PM|lhcathome|Reason: no work from project

I have detatched and rebooted with the same result, "no work from project" because of too recent a request.

I am currently focusing on POEM and LHC on my home PC's and this is the only one doing this. the other 3 PC's are XP(2) and 2k(1).

64) Message boards : Number crunching : Stand back....... (Message 18392)
Posted 27 Oct 2007 by Brian Silvers
Post:
Just passed 100,000. I think the quota may be going up soon :-)


As I said some time ago, if they have work behind this to do, then a good starting point on the quota would be 8. That is more reasonable because it gives approx 4 hours of work for most modern machines, assuming a 30 minute running WU, which most of mine that ran for the full time have been.

Clearly with the quota set at 4 and the current work generation rate, the project is generating work faster than it can be processed. The big "if", is if the project can sustain this generation rate...?

Brian
65) Message boards : Number crunching : The new look bugs (Message 18324)
Posted 20 Oct 2007 by Brian Silvers
Post:
It's impossible to select and copy parts of a page of my choice. When I try to select a line of text, e.g. on the results page, everything from the start of the page up to the current position is selected.
(Windows XP with IE 6)

IE bug, get a decent browser ;-)

Right,IE7 is for IE6 users the best thing. :-)


I can validate that IE7 has the same 1280 res issue. The table doesn't resize properly. Works fine in Firefox. From my brief dabbling with an XHTML/PHP class, I know that incompatibilities abound between the two of them...
66) Message boards : Number crunching : Server not reporting "No Work from project" (Message 18304)
Posted 19 Oct 2007 by Brian Silvers
Post:
Hmmm... This only started as of today... Why would a client bug manifest itself after several months of operation? Do you have a Trac bug number and a way I can look at the bug log?

I don't think it was reported on Trac. Devs probably found it themselves without any user reporting it.



Well, I decided to restart BOINC and see what happens. Next connect in 4 minutes. I'll edit this with the results and further comment...

Well, same thing... This is 5.8.16, so it's not super-old... I wasn't fond of what I was hearing about 5.10.x versions, so I hung back with 5.8.16... This is a minor nuisance...and I'd prefer to be able to see the deferral, so I'll just hang out for a while and see what happens...
67) Message boards : Number crunching : Server not reporting "No Work from project" (Message 18302)
Posted 19 Oct 2007 by Brian Silvers
Post:
The two deferral messages are a client bug. Upgrade to a newer version.


Hmmm... This only started as of today... Why would a client bug manifest itself after several months of operation? Do you have a Trac bug number and a way I can look at the bug log? I used to be able to browse the list without logging in, but I can't seem to find how I did that...
68) Message boards : Number crunching : Server not reporting "No Work from project" (Message 18300)
Posted 19 Oct 2007 by Brian Silvers
Post:
OK... It's getting a bit strange... You'll notice that within the same session there are two deferrals with the no work coming second. The next scheduler connect only has the one entry "requested by project" and goes back to the 30 minute interval instead of the longer backoff...

10/18/2007 9:49:50 PM|lhcathome|Sending scheduler request: To fetch work
10/18/2007 9:49:50 PM|lhcathome|Requesting 2960 seconds of new work
10/18/2007 9:49:55 PM|lhcathome|Scheduler RPC succeeded [server version 505]
10/18/2007 9:49:55 PM|lhcathome|Deferring communication for 30 min 18 sec
10/18/2007 9:49:55 PM|lhcathome|Reason: requested by project
10/18/2007 9:49:55 PM|lhcathome|Deferring communication for 41 min 9 sec
10/18/2007 9:49:55 PM|lhcathome|Reason: no work from project
10/18/2007 10:31:06 PM|lhcathome|Sending scheduler request: To fetch work
10/18/2007 10:31:06 PM|lhcathome|Requesting 3001 seconds of new work
10/18/2007 10:31:11 PM|lhcathome|Scheduler RPC succeeded [server version 505]
10/18/2007 10:31:11 PM|lhcathome|Deferring communication for 30 min 18 sec
10/18/2007 10:31:11 PM|lhcathome|Reason: requested by project
69) Message boards : Number crunching : Server not reporting "No Work from project" (Message 18296)
Posted 18 Oct 2007 by Brian Silvers
Post:
Main page says no work. Scheduler contacts give the following:

10/18/2007 7:17:44 PM|lhcathome|Sending scheduler request: To fetch work
10/18/2007 7:17:44 PM|lhcathome|Requesting 2526 seconds of new work
10/18/2007 7:17:49 PM|lhcathome|Scheduler RPC succeeded [server version 505]
10/18/2007 7:17:49 PM|lhcathome|Deferring communication for 30 min 18 sec
10/18/2007 7:17:49 PM|lhcathome|Reason: requested by project
10/18/2007 7:48:09 PM|lhcathome|Sending scheduler request: To fetch work
10/18/2007 7:48:09 PM|lhcathome|Requesting 2550 seconds of new work
10/18/2007 7:48:14 PM|lhcathome|Scheduler RPC succeeded [server version 505]
10/18/2007 7:48:14 PM|lhcathome|Deferring communication for 30 min 18 sec
10/18/2007 7:48:14 PM|lhcathome|Reason: requested by project
70) Message boards : Number crunching : Initial Replication (Message 18295)
Posted 18 Oct 2007 by Brian Silvers
Post:

If you need the results fast, just shorten the deadline, that's what it's for, and set IR at a decent level.


As I keep mentioning, the best solution to keep the project's decision of the IR set to where it's set and to keep you folks who are up in arms over this happy is to just implement the server-side aborts. No, it won't 100% eliminate "redundant results", but it will cut out a sizeable portion of them...

71) Message boards : Number crunching : Initial Replication (Message 18292)
Posted 18 Oct 2007 by Brian Silvers
Post:
you simply don't like my posts and you don't like them because it means less WUs.


I cut out all the noise and boiled it down to this. The "you" mentioned above is to be taken in general, not specific, as your words were aimed at those who disagree with you...

I think you can tell that I don't totally disagree with you. At least I hope you can tell that... Having said that, in my opinion, you are spamming multiple threads and what you're doing does border on hijacking the thread.

I've laid out some reasons why the additional replication can help speed the process up. Neasan has said that they will revisit the issue with the project scientists. Alex and Neasan are only the administrators of the servers. They have to abide by what the project scientists want.

That said, Neasan, I think the idea of doing the server-side aborts is worthwhile. It still allows the IR to be set to 5, but if a workunit has met quorum and has been validated, it allows you to attempt to cancel the remaining replications and possibly get that result into the science database a little bit faster. I say "attempt to" because if the version of BOINC that the host is using doesn't support the aborts, it won't work. I said "possibly" because if the host doesn't support the abort and/or one of the hosts doesn't contact the scheduler before the deadline, then it will still take the full duration of the longest deadline to be able to send the workunit through the assimilation process...

Brian

72) Message boards : Number crunching : Server thinks I recevied a WU (Message 18290)
Posted 18 Oct 2007 by Brian Silvers
Post:
Also I want to ask one stupid question (don't want to create separate topic):

Tell me (I'm beginner at LHC@home), why Upload/download and Scheduler servers are disabled?

Thank you very much :(


I'm not with the project in any way, but I noticed this yesterday when they were out of work. After a while, the main page said "low on work" and then there were 50000ish workunits available... Perhaps they take those jobs offline while they are generating work? Just a guess...

Brian
73) Message boards : Number crunching : Initial Replication (Message 18262)
Posted 18 Oct 2007 by Brian Silvers
Post:

Good point but having 5 crunchers working on WU A when quorum = 3 means WU B gets delayed (because 2 of those 5 crunchers could be working on WU B rather than A).


You could argue that replicating anything more than the quorum is causing a "delay" in work being done, but you have to keep in mind that the insertion into the science database is the ultimate goal, and it may or may not be delayed by more replication. The fact that LHC units are so short at this point in time and that nobody is holding a large cache makes it difficult to give you a good example of what can happen if you get a reissue.

To give you a better idea of what can happen, take a look at this example from Einstein numerous reissues.

As you read down that list, the results were issued in order from top to bottom. The first two were generated. The 2nd host bombed out of it, so it got reissued. That was the same day, so not a big negative impact. So, the 3rd host reports, but the 1st host runs out the deadline. This causes another result to get issued, but it would've had 3 weeks to make it back in. This new host fails out the next day. Due to the way Einstein's data packs are handled, the next available host doesn't come along for a week. They too burn up the entire 3 weeks, and so another result has been issued.

Had the intitial replication been higher, perhaps set at 3, another host would've picked up the result and run it successfully, thus making the longest time that result might've been in the work queue approximately 3 weeks. Instead, it is now 8+ weeks. Sure, there's no "guarantee" that the extra replication would've helped a bit, but it can help, depending on the circumstances.

Since the LHC units are so short running and since people are not able to maintain large queues right now, the consequence of this has been minimized. Additionally, Einstein isn't a time-sensitive project. They can wait a few extra weeks for the results if need be. This is why they can do the lower replication.

As for task A and task B waiting, the bigger cause of any "wait" right now is the forced low quota...

To make the determination you're making that 5 is "wasteful", you really need to know the exact error rates on the first replication. I don't think someone outside of the project team can know that for a fact...



Earlier in this thread someone mentioned the error rate is 25%. Since that's never been disputed, I've assumed it's true.


That may or may not be a safe assumption. I'd seek clarification (politely) from Alex or Neasan.


I think the best thing to do is to implement the server-side aborts, like what SETI did, but leave the replication at 5.


If Neasan or Alex would agree to that then I would shut up.

[/quote]

Bear in mind that it may need a server upgrade, and folks like me, that use BOINC 5.8.16, would not process the server-side requests due to the support for it was added in BOINC 5.8.17 (I believe). I know 5.8.16 doesn't support it...

Brian
74) Message boards : Number crunching : Initial Replication (Message 18242)
Posted 17 Oct 2007 by Brian Silvers
Post:

Which pretty much proves they don't need the results back ultra-fast (the argument some people were using to justify IR=5 in spite of the fact IR=5 only slows things down).

<snip>

Well, if your objective is do unnecessary work then you can be happy. The fact is the project could be getting the same results (a quorum of 3) with 40% less work. Nobody should be happy about that.



What does indeed get slowed down is archiving the completed workunits. The workunit as a whole must remain so long as there is one resultID that hasn't been turned back in and that has not passed the deadline for the result.

From what I've been able to read (and experience), this project is much more sensitive to floating point math differences than others. I had a couple of results that were declared invalid just over this past week. In both cases I was either first or second to report a completed result. If the replication had been at 3 and quorum at 3, then there would've been at least one more replication made. That replication would have the same amount of time to be returned as the initial replication, but it causes the workunit as a whole to be waiting longer to be stored in the Master Science Database than perhaps a replication of 5, all with the same deadline, would have.

To make the determination you're making that 5 is "wasteful", you really need to know the exact error rates on the first replication. I don't think someone outside of the project team can know that for a fact...

I think the best thing to do is to implement the server-side aborts, like what SETI did, but leave the replication at 5.

Brian
75) Message boards : Number crunching : Initial Replication (Message 18223)
Posted 17 Oct 2007 by Brian Silvers
Post:
IR can be set to 5, like it is now, then when the server software version upgrade is done, LHC can implement the server-side "redundant result" cancellation. SETI did this and it worked out just fine. It's not so much needed now as they've gone down to IR=2, MQ=2...

Dagorath mentioned this some time ago in this thread, but I think the method of delivery of the idea was...less than ideal due to being intertwined with some bickering amongst several different people...

How it works is that if a quorum has been reached, when a client connects to the scheduler, any results that the host has that have already made quorum and validated can be cancelled from the server. This can be done one of three different ways:

1. If the client (host) has not started the result at all, delete the result from the host.
2. If the client has started the result, let it go to completion.
3. Delete the result regardless of whether or not it has been started.

I want to say the support for that was included in BOINC 5.8.17, but I'm not sure... With the relatively short running results here, I wouldn't be opposed to option 3, although options 1 & 2 are the most "user-friendly"...

This allows the project to keep IR=5, but satisfies the concerns of people who are mentioning the waste of electricity...

FWIW, YMMV, etc, etc, etc...

Brian
76) Message boards : Number crunching : work units?? (Message 18210)
Posted 16 Oct 2007 by Brian Silvers
Post:
Personally, I'm pretty tired of waiting so long for WU's, then being limited to 4 a day. If they want to make sure that all the newbies get WU's, that's fine. They'll just have to find one more newbie to replace me.

--Mike

You make a good point, that is to not announce the press release until there is plenty of work for everybody (old timers and newbies ;).


The press release would tend to imply that there is more work on the horizon. The only problem is going to be if they do not deliver this work, then whatever "goodwill" that was purchased by having a few WUs available to work on will be burned up...

Personally, 2/day was ridiculously low anyway. The longest that I've seen any of my quorum partners take is just a bit under an hour, so for most modern systems, 20-30/day is well within the realm of possibility. If they want to attract and retain new folks, they need to provide at least 4 hours of work / day, IMO. This would seem to need a quota of 8 or 10. Double that for 8 hours of work, so 16-20/day.

It all boils down to whether or not they actually can provide a constant stream of work. If not, then there really was no benefit to drumming up support via new users...as it will only exasperate everyone, old and new...

IMO, YMMV, etc, etc, etc...

Brian
77) Message boards : Number crunching : Ghost result (server thinks I got it, but I didn't) (Message 18171)
Posted 15 Oct 2007 by Brian Silvers
Post:

I will investigate this but assume (as POVaddict seems to love mentioning) this will involve the server upgrade.


Thanks, but it may just give me one more result that ends up being declared invalid. Not sure why I'm getting a 50% invalid rate here now... :shrug:
78) Message boards : Number crunching : Overclocking?....Not! ;) (Message 18162)
Posted 15 Oct 2007 by Brian Silvers
Post:
I believe LHC is much more sensitive to floating point errors than other projects. They have even had problems with the slight differences between the way Intel and AMD implement floating point math. Your overclocked CPU is probably returning slightly incorrect results on other projects as well but they match closely enough that you still get credit. The thing with LHC is that if the first calculation is off by even 0.0000[...]001 then by the end you are way off because later calculations are all based on previously calculated numbers.


I too have an overclocked system and am getting a 50% invalid ratio at this point with LHC. I don't think that this happened before the move to the UK, but yes, I realize that with extended time of overclock/overvolt, problems can begin to happen...

Question to Alex/Neasan:

Are the validators the same as what they were before the move, or has there been any change made to the validator code?
79) Message boards : Number crunching : Ghost result (server thinks I got it, but I didn't) (Message 18140)
Posted 14 Oct 2007 by Brian Silvers
Post:
Those of us who participate with SETI@Home are familiar with this kind of thing happening. The admins at SETI figured out how to resend those resultIDs to the volunteers (or so I recall...someone correct me if I'm wrong)...

No need to "figure out" too much. It's just a setting that has existed for-freakin-ever: <resend_lost_results>.


Is that the same thing though? I thought that kicked in when the client reported that it disappeared from the state file? My client never saw the download at all, thus my side knows nothing about it...
80) Message boards : Number crunching : Ghost result (server thinks I got it, but I didn't) (Message 18138)
Posted 14 Oct 2007 by Brian Silvers
Post:
I have a workunit that the server thinks that I got, but my client never downloaded. The resultID is 8874855

Right when this happened, there was a hiccup with the LHC servers. The server page says that the result was sent to me at 14 Oct 2007 0:37:24 UTC. Here is my message log:


10/13/2007 8:33:51 PM|lhcathome|Sending scheduler request: To fetch work
10/13/2007 8:33:51 PM|lhcathome|Requesting 2516 seconds of new work, and reporting 1 completed tasks
10/13/2007 8:38:53 PM||Project communication failed: attempting access to reference site
10/13/2007 8:38:56 PM|lhcathome|Scheduler request failed: a timeout was reached
10/13/2007 8:38:56 PM|lhcathome|Deferring communication for 1 min 0 sec
10/13/2007 8:38:56 PM|lhcathome|Reason: scheduler request failed


Those of us who participate with SETI@Home are familiar with this kind of thing happening. The admins at SETI figured out how to resend those resultIDs to the volunteers (or so I recall...someone correct me if I'm wrong)...

Just thought you might like to know...

Brian


Previous 20 · Next 20


©2024 CERN