Message boards : Number crunching : Why BOINC 6.X has issues with LHC@home and other things
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Profile Neasan
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 30 Nov 06
Posts: 234
Credit: 11,078
RAC: 0
Message 20722 - Posted: 30 Oct 2008, 14:59:48 UTC

It\'s because of the screensaver graphics so we are working on rolling out a new version of SiXTrack sans graphics ASAP and eventually finding a replacement package.

I swear this project is going to be more professional very soon, Alex and I are trying to implement a system around BOINC regarding job submission, application updates, server upgrades, knowledge base etc etc that was not in place and so needs to be almost reverse engineered into place, this means it is quiet on the outside but moving along in the background similar to the server upgrade (another coming soon) and the deletion of as many duplicate hosts as we could do safely.

We have been reading the boards whenever we get 5 minutes but I have been uber-snowed under with work nonsense so have not been replying (sorry) but I can say that Alex has been out at CERN this week so hopefully more great news to follow if not look out for a murder in the east end of London next week although that wouldn\'t be very rare.

Thanks for sticking with us folks especially Dagorath who I have only contemplated killing on one or two occasions ;-)
ID: 20722 · Report as offensive     Reply Quote
Simplex0

Send message
Joined: 26 Aug 05
Posts: 68
Credit: 545,660
RAC: 0
Message 20723 - Posted: 30 Oct 2008, 15:34:58 UTC

The error occurs despite if you are running the graphics or not.

Do you mean that the screensaver graphics are always running no mater if you have enabled it or not???

I have disabled LHC@home for now but I hope that you eventually will solve the problem, good luck.

ID: 20723 · Report as offensive     Reply Quote
Profile Alex Owen
Volunteer moderator

Send message
Joined: 19 Dec 06
Posts: 22
Credit: 28
RAC: 0
Message 20726 - Posted: 30 Oct 2008, 17:44:46 UTC - in response to Message 20722.  

Hello,
As Neasan said I have been working out at CERN with Eric McIntosh trying to work out why these jobs have been failing... To help debug this we have sent some more jobs through the system, some of these jobs have run OK some have failed. We are working on this although we do not fully understand it... yet!

Alex
ID: 20726 · Report as offensive     Reply Quote
Simplex0

Send message
Joined: 26 Aug 05
Posts: 68
Credit: 545,660
RAC: 0
Message 20727 - Posted: 30 Oct 2008, 18:33:20 UTC - in response to Message 20726.  

Hello,
As Neasan said I have been working out at CERN with Eric McIntosh trying to work out why these jobs have been failing... To help debug this we have sent some more jobs through the system, some of these jobs have run OK some have failed. We are working on this although we do not fully understand it... yet!

Alex


Adversus solem ne loquitor ;)

ID: 20727 · Report as offensive     Reply Quote
Stephen Balch 2

Send message
Joined: 16 Jul 08
Posts: 10
Credit: 18,767
RAC: 0
Message 20740 - Posted: 5 Nov 2008, 22:50:46 UTC

Neasan and Alex,

I know some people seem to be having problems with Boinc 6.2.*, but my installation, at least through 6.2.18, seems to be working correctly and without issues. Have you blocked Boinc 6.2.19 hosts from receiving work?

I can see the \"Results ready to send\" count change, and the \"Results in progress\" count change, and even the \"Workunits waiting for validation\" count change, even as I type this. No, I\'m not relying on the server status page counts, just the fact they change. I just cannot seem to get any work from this project. I haven\'t gotten any work since 15 October 2008 (with Boinc 6.2.18). I\'m pretty sure there is work this morning, but:

11/5/2008 12:20:29|lhcathome|Sending scheduler request: Requested by user. Requesting 864000 seconds of work, reporting 0 completed tasks
11/5/2008 12:20:34|lhcathome|Scheduler request succeeded: got 0 new tasks


I\'ve been getting this class of messages for hours this morning (local, US CST, GMT-6).

Previously, when I did manage to get work, I got it back to the project as quickly as possible, suspending the other projects to provide good turn-around here. My computer summary shows \"Average turnaround time 0 days, Maximum daily WU quota per CPU 10/day\", so the project shouldn\'t have any performance issues with me and mine. I have an AMD Turion 64 X2 TL-60 (dual core) at 2.0 GHz with 2 GB RAM running Vista Home Premium X86 (32-bit) with SP1, I don\'t believe LHC is running HR so platform shouldn\'t matter. While it is not the fastest machine on the project, it\'s not the slowest either.

My project settings are a 400 share (was 300, and all other projects suspended), \"processor usage\" tab and \"disk and memory usage\" tab settings have not changed. \"network usage\" settings are: \"Connect about every\" = 0.5000 days, \"Additional work buffer\" = 4.500 days, and nothing else on that tab has been changed.

If Boinc is allowed to contact LHC at it\'s leisure, I never get any work because of the ever increasing delay built into the system if no results are received on a communication attempt. Because of this, I feel I need to micro-manage Boinc when work does seem to be available on LHC just to try to get any work, but then I get the sequence of messages below. By the time Boinc decides to try contacting the project on it\'s own, in 2 1/2 hours or more, there almost certainly won\'t be any work available. I\'ve seen that.

11/5/2008 12:20:29|lhcathome|Sending scheduler request: Requested by user. Requesting 864000 seconds of work, reporting 0 completed tasks
11/5/2008 12:20:34|lhcathome|Scheduler request succeeded: got 0 new tasks
11/5/2008 12:22:40|lhcathome|Sending scheduler request: Requested by user. Requesting 864000 seconds of work, reporting 0 completed tasks
11/5/2008 12:22:45|lhcathome|Scheduler request succeeded: got 0 new tasks

11/5/2008 12:22:45|lhcathome|Message from server: Not sending work - last request too recent: 130 sec

The message in red above would indicate to me that there is work available but because of some \"wonderful\" decision by someone at the project (or perhaps the Boinc developers), I can\'t get any of it. It is extremely frustrating. I don\'t want to hammer the project with work requests but that seems to be the only option available once the \"Results ready to send\" count shows the possibility of available work. Now it seems that option is also blocked by this minimum time between requests thing.

If you can, please try to remove this bottle neck in the system, or at least set the minimum request period to something reasonable (like 5 seconds or less).

Now server communications has failed yet again, an all too common occurrence on this project. With a budget like CERN is supposed to have, why can\'t they drop a few more (appropriate units of currency) on this end of the project to provide additional assistance for you two guys and a more reliable communications link with more bandwidth. I remember how badly the project servers were hammered during the Grand Opening.

11/5/2008 13:03:45|lhcathome|Scheduler request failed: Couldn\'t connect to server
11/5/2008 13:04:16|lhcathome|Sending scheduler request: Requested by user. Requesting 864000 seconds of work, reporting 0 completed tasks

11/5/2008 13:04:38||Project communication failed: attempting access to reference site
11/5/2008 13:04:40||Internet access OK - project servers may be temporarily down.
11/5/2008 13:04:41|lhcathome|Scheduler request failed: Couldn\'t connect to server

-- and --

11/5/2008 13:08:51|lhcathome|Sending scheduler request: Requested by user. Requesting 864000 seconds of work, reporting 0 completed tasks
11/5/2008 13:09:13||Project communication failed: attempting access to reference site
11/5/2008 13:09:15||Internet access OK - project servers may be temporarily down.
11/5/2008 13:09:16|lhcathome|Scheduler request failed: Couldn\'t connect to server
11/5/2008 13:20:56|lhcathome|Sending scheduler request: Requested by user. Requesting 864000 seconds of work, reporting 0 completed tasks
11/5/2008 13:21:34||Project communication failed: attempting access to reference site
11/5/2008 13:21:36||Internet access OK - project servers may be temporarily down.
11/5/2008 13:21:36|lhcathome|Scheduler request failed: Server returned nothing (no headers, no data)
11/5/2008 13:22:36|lhcathome|Fetching scheduler list
11/5/2008 13:23:01|lhcathome|Master file download succeeded
11/5/2008 13:23:06|lhcathome|Sending scheduler request: Requested by user. Requesting 864000 seconds of work, reporting 0 completed tasks
11/5/2008 13:23:11|lhcathome|Scheduler request succeeded: got 0 new tasks


The comms problem does not appear to be my local connection since I am also accessing UK commercial site (which has links to a German site) in another tab in FireFox 3.

I would like to make a suggestion, if I may... instead of dribbling out what appears to be just a few WU\'s at a time into the \"Results ready to send\" queue (all of which are probably gone before I even see the count on the page), could you collect then and drop larger blocks (thousands?) of WUs into the queue at a time?

On 30 October 2008 Neasan stated, \"I swear this project is going to be more professional very soon,\". I would like to know when \"very soon\" is. I\'m really very frustrated with LHC. Please don\'t take this as an attack on you two, I think the project is in great hands with the Irish and Welsh (?) working on it. <GRIN> I have confidence in you both. I know you are working on it, and having been a professional in Data Processing/Information Management Systems, I know it can take some time to debug and fix problems. It\'s just my frustration at not being able to get work when I know it\'s available.

Cheers,
Stephen

P.S. Well, lost another window (you\'ll pardon the expression) for getting work, the \"Results in progress\" count is dropping steadily... maybe next time (but I won\'t hold my breath).
I The perversity of the universe tends to a maximum.
II If something can go wrong, it will.
-- Finagle's First and Second Laws

Join Team Richard Feynman and crunch in memory of the great Physicist and Teacher (and bongo player) !!!
ID: 20740 · Report as offensive     Reply Quote
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 798
Credit: 644,679,969
RAC: 235,452
Message 20742 - Posted: 6 Nov 2008, 19:41:35 UTC - in response to Message 20741.  

I think they don\'t have much time to run the project.

Maybe you could volunteer your services to help setup the system for free?
ID: 20742 · Report as offensive     Reply Quote
Simplex0

Send message
Joined: 26 Aug 05
Posts: 68
Credit: 545,660
RAC: 0
Message 20744 - Posted: 6 Nov 2008, 22:30:58 UTC - in response to Message 20743.  

Why should I work for them for free when their parent project has a $6 trillion budget and more money on the way? If they want me they can pay me a respectable wage.



So that\'s the reason why you are complain time after time after time after time on the same issue instead of just choosing to run an other project. You are one of those consults who is looking for a job. :)
ID: 20744 · Report as offensive     Reply Quote
jrobbio

Send message
Joined: 12 Sep 08
Posts: 10
Credit: 2,747
RAC: 0
Message 20746 - Posted: 7 Nov 2008, 11:13:33 UTC - in response to Message 20745.  

Why should I work for them for free when their parent project has a $6 trillion budget and more money on the way? If they want me they can pay me a respectable wage.



So that\\\\\\\'s the reason why you are complain time after time after time after time on the same issue instead of just choosing to run an other project. You are one of those consults who is looking for a job. :)


Nice try, Tomas, but you are wrong :)

I do run other projects. None of my computers are attached to LHC because more than 30% of the work done for LHC is wasted effort. It is wasted because they have set IR > minQ. Using the IR > minQ to get results verified sooner was justifiable years ago when BOINC server did not have as many features as it does now. Modern versions of BOINC server has features that permit efficent strategies for getting results verified quickly. Unfortunately, LHC steadfastly refuses to implement those strategies and use the CPU time donated to them efficiently, the way professionally run projects attempt to do.

The reason I complain about LHC\\\'s wasteful practices is because they steal CPU time away from other worthy projects.



Do you mean like what is suggested by the WCG in this document?
http://boinc.berkeley.edu/trac/attachment/wiki/WorkShop08/ServerManagement-BOINC2008.pdf?format=raw

Same doc as a powerpoint:
http://boinc.berkeley.edu/trac/attachment/wiki/WorkShop08/ServerManagement-BOINC2008.ppt?format=raw


ID: 20746 · Report as offensive     Reply Quote
Profile Neasan
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 30 Nov 06
Posts: 234
Credit: 11,078
RAC: 0
Message 20764 - Posted: 13 Nov 2008, 17:03:29 UTC - in response to Message 20743.  

LHC@home\\\'s parent project has a $6 trillion budget

LHC@home has no parent project we have nothing to do with LHC other than LHC scientists and engineers use our system to run their calculations, the resources and time are donated by all, Alex, me, the server here at QMUL, your computers your time this is all donated.

The LHC has a massive budget and all of it is spoken for that is how science works, you ask for X to do Y and you (might) get X to do Y not Y + Z.

You think your frustrated with the money thing, I do LHC@home in my spare time and have 3 other jobs so they can keep me funded.

I don\'t know how you think science funding works but you seem to have it wrong the LHC has cost so far £3.5bn (including the Grid infrastructure and people costs) in total and that has been spent it is gone. Also this cost has been spread between the contributing nations, that may sound like a lot of money but the National Health Service here in the UK alone costs £105.6bn each year that is 30 LHCs.

Science budgets are tiny and we have bid time and time again for money from this project but have been rejected, Dagorath we\'ve had this discussion before, you are volunteering your time if you are unhappy you are free to leave and stop contributing when you want.

In other news we are not blocking any particular clients that is not preventing you from getting work. You also have to wait a certain length of time between requests this prevents flooding the server with requests and taking down the service. There is also a quota on how much work you can get in a day so you may have reached that. Also there may not have actually been work the numbers on the front page are generated and are not real time up usually correct within a certain value.

Also the scientists are looking at the quorum and initial replication so there could be news on tha front too.
ID: 20764 · Report as offensive     Reply Quote
Profile Neasan
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 30 Nov 06
Posts: 234
Credit: 11,078
RAC: 0
Message 20766 - Posted: 14 Nov 2008, 12:36:08 UTC - in response to Message 20765.  

the very least they can do and it would take much from their budget


The least they can do is nothing and I have explained this the budget is set, they asked for money to do certain things and they got the money to do certain things and LHC@home was not in that plan THEY ARE NOT ALLOWED DIVERT FUNDS THAT HAVE BEEN ALLOCATED.


And if they don\'t like that then let them buy the CPU time they pee down the drain. I have a hunch things would get fixed real quick if they had to do that.

The work just wouldn\'t get done in that case, there is no money for this, no money at all zilch, zero, nada, nothing if they had to buy the CPU this work would not get done, this is not the perfect scenario for the LHC but that is what they would have to do.

You seem to think we relish \"stealing CPU\" and working on a best effort basis with no money. I can assure you this project is very frustrating to work on without you telling me it is (by the way thanks for the reminder I was actually having a good day).

Do you think we have a massive pile of money we control and we are just sitting on it laughing maniacally? If we could make money and more effort appear we would have by now, all we can do is keep bidding for money from various bodies.

Yes this what this project does is important but when they look at the list of things to do be done the see more important things, like actually fixing the machine after the failure in September.

Your badgering doesn\'t \"shame the powers that be\" it just winds me up. We know you have misgivings about the IR minimum quorum and we have been talking to the scientists about it the new SixTrack they are working on should also to drop both of these numbers. In this case the squeaky wheel does not get the oil it merely makes the user consider walking
ID: 20766 · Report as offensive     Reply Quote
Lord Crc

Send message
Joined: 1 Dec 06
Posts: 13
Credit: 765,437
RAC: 0
Message 20778 - Posted: 18 Nov 2008, 3:28:49 UTC - in response to Message 20767.  

Regarding diverting funds... there is someone who can divert funds, there always is. If funds haven\'t been diverted then you just haven\'t spoken to the right person or else you have but they\'re pretending to be deaf.


In other words, you have absolutely no idea what you\'re talking about at all.

As for me, I consider the scientists time more valueable than my CPU\'s idle cycles.
ID: 20778 · Report as offensive     Reply Quote
Lord Crc

Send message
Joined: 1 Dec 06
Posts: 13
Credit: 765,437
RAC: 0
Message 20781 - Posted: 18 Nov 2008, 12:43:12 UTC - in response to Message 20779.  

Nice try but guess again.


So, where in the world can you divert funds from a gov\'t funded research project by simply \"[speaking] to the right person\"?

[quote]Fools and their money always part quickly.[/qoute]

I\'m gladly giving less than $0.1 a month to what I consider the most impressive project since the great pyramids were built. If you think that\'s foolish, fine for you, I couldn\'t care less. Personally I think it\'s great that I can contribute, even if it\'s just a tiny tiny bit.
ID: 20781 · Report as offensive     Reply Quote
1 · 2 · Next

Message boards : Number crunching : Why BOINC 6.X has issues with LHC@home and other things


©2024 CERN