Message boards : Number crunching : Solution for LHC Long Term debt problem ?
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Profile FalconFly
Avatar

Send message
Joined: 2 Sep 04
Posts: 121
Credit: 592,214
RAC: 0
Message 13905 - Posted: 9 Jun 2006, 17:35:48 UTC
Last modified: 9 Jun 2006, 17:36:14 UTC

Since I have attached LHC, all machines periodically check for new work obviously.

However, as LHC most the time reports "No Work from Project", my Clients continually build up tremendous amounts of Long Term debt.

That alone wouldn't be a Problem, but the way BOINC V5.2.13 (didn't care to upgrade yet) handles Long Term debt, it actually and continually shrinks the effective Cache of all machines.

In the long run, even with work from 2 other attached Projects and "Connect to Network" every 3 days, the actual cache eventually shrinks to a single WorkUnit (may be as little as 30 Minutes).

That causes my entire Network to run dry in a matter of a few hours, no matter how high I set the Cache to compensate.

Now, the Golden Question :
Does any newer Version of BOINC fix this annoying Problem ?

Right now, my only chance to restore halfway normal Cacheing again is to Suspend LHC on all machines and reset Project upto 10x in a row (in order to kill the hilarious Long Term debts).

If there's no stable and long-sighted Solution to this Problem, I'll eventually have to abandon LHC (in favor of not losing upto 50% of my computing power caused by machines running empty of work)
Scientific Network : 45000 MHz - 77824 MB - 1970 GB
ID: 13905 · Report as offensive     Reply Quote
NJMHoffmann

Send message
Joined: 26 Nov 05
Posts: 16
Credit: 14,707
RAC: 0
Message 13906 - Posted: 9 Jun 2006, 20:13:14 UTC - in response to Message 13905.  
Last modified: 9 Jun 2006, 20:28:04 UTC

Now, the Golden Question :
Does any newer Version of BOINC fix this annoying Problem ?
I don't see a changed behaviour with 5.4.9 and there is no newer development version available AFAIK. It seems to work as designed. Try to post your question to the BOINC board.

Norbert

(Edit: Corrected link)
ID: 13906 · Report as offensive     Reply Quote
Travis DJ

Send message
Joined: 29 Sep 04
Posts: 196
Credit: 207,040
RAC: 0
Message 13907 - Posted: 9 Jun 2006, 20:24:36 UTC

Does any newer Version of BOINC fix this annoying Problem ?


Honestly, I don't know. The 5.4.x BOINC client updates have to do with network connectivity, security, screen saver and performance enhancements. My observations on 5.4.9:

1) I'm attached to 5 projects
2) LHC & Rosetta RS:10
3) Einstein, CPDN, and LHC-Alpha RS:1
4) Connect to network set at 0.1 days
5) There's always work for all projects who are issuing work in my tasks with respect to project settings, WU Deadlines, and LTD
6) When LHC issues WUs, BOINC works off the LTD as best it can with respect to project settings and WU deadlines
7) I haven't experienced the problem you're describing on this or prior versions of BOINC

FWIW, you don't have to continually reset your projects to get the LTD to disappear. You can manually remove it if you choose. If you're running BOINC as a service try this:

* Go to Start -> Run -> Type "net stop boinc" and click ok
* Close the Boinc Manager if it's open
* Find your BOINC folder in Program Files and then locate client_state.xml
* Right-Click on it and choose Edit (It should open in notepad)
* Look for LHC@Home's section and find the <long_term_debt>xxxx</long_term_debt>
* Replace xxxx with 0.000000
* Save & Exit Notepad
* Go to Start -> Run -> Type "net start boinc" *done*

If you're running BOINC only while the user is logged in, Just Exit the Boinc manager and skip the net start/stop commands.

ID: 13907 · Report as offensive     Reply Quote
Profile FalconFly
Avatar

Send message
Joined: 2 Sep 04
Posts: 121
Credit: 592,214
RAC: 0
Message 13908 - Posted: 9 Jun 2006, 21:56:19 UTC - in response to Message 13907.  

Hm, I forgot to mention that the Network is connected only infrequently to the Internet; for as long as there's a Connection there's of course enough work but needs to be frequently downloaded.

I need a cache that keeps the machines busy during offline periods upto approx. 30 hours (worst case).

( manually fixing the long term debt on 24 machines is unfortunately out of the question :/ )
Scientific Network : 45000 MHz - 77824 MB - 1970 GB
ID: 13908 · Report as offensive     Reply Quote
Profile Steve Cressman
Avatar

Send message
Joined: 28 Sep 04
Posts: 47
Credit: 6,394
RAC: 0
Message 13909 - Posted: 9 Jun 2006, 22:02:33 UTC
Last modified: 9 Jun 2006, 22:12:18 UTC

Or you could get Boinc Debt Viewer . It has a feature that allows your to reset your debts with the click of a button. But just like the other method you need to shut down Boinc before doing so. And if you forget it reminds you too.
:)
98SE XP2500+ @ 2.1 GHz Boinc v5.8.8
ID: 13909 · Report as offensive     Reply Quote
Bill Hepburn

Send message
Joined: 18 Sep 04
Posts: 10
Credit: 5,127,937
RAC: 101
Message 13911 - Posted: 9 Jun 2006, 22:27:54 UTC - in response to Message 13906.  

Now, the Golden Question :
Does any newer Version of BOINC fix this annoying Problem ?
I don't see a changed behaviour with 5.4.9 and there is no newer development version available AFAIK. It seems to work as designed. Try to post your question to the BOINC board.

Norbert

(Edit: Corrected link)


Interesting. I have noticed that for me, LHC seems to only rack up about 10000 seconds of Long Term Debt when it runs out of work. That is, of course, enough that it only wants to get work from LHC unless it is going to run out of work. But, when LHC comes back it only runs LHC for a couple of hours, then starts cycling through the other projects.

The "exponential backoff" doesn't seem to work for me as advertised. It seems to only back off to about three hours, after the three hour back off runs out, it goes through a bunch of one minute back offs, then a bunch of two minutes, etc... I usually wind up suspending to avoid filling the logs with hundreds of messages telling me that LHC is out of work.

Nothing earth shattering, but enough to cause some moderate grumbling.
ID: 13911 · Report as offensive     Reply Quote
Profile Steve Cressman
Avatar

Send message
Joined: 28 Sep 04
Posts: 47
Credit: 6,394
RAC: 0
Message 13912 - Posted: 10 Jun 2006, 5:10:06 UTC

To get a LTD around 10000 you probably have a switch between applications time of 180 minutes
98SE XP2500+ @ 2.1 GHz Boinc v5.8.8
ID: 13912 · Report as offensive     Reply Quote
Bill Hepburn

Send message
Joined: 18 Sep 04
Posts: 10
Credit: 5,127,937
RAC: 101
Message 13913 - Posted: 10 Jun 2006, 5:19:12 UTC - in response to Message 13912.  

To get a LTD around 10000 you probably have a switch between applications time of 180 minutes


Precisely. It seems it ought to continue to accumulate debt beyond that, but it doesn't.

One of life's mysteries methinks.

ID: 13913 · Report as offensive     Reply Quote
Travis DJ

Send message
Joined: 29 Sep 04
Posts: 196
Credit: 207,040
RAC: 0
Message 13915 - Posted: 10 Jun 2006, 7:47:13 UTC - in response to Message 13909.  

Or you could get Boinc Debt Viewer . It has a feature that allows your to reset your debts with the click of a button.


That utility sure does simplify that process. It's kinda cool for those who need it. :)

ID: 13915 · Report as offensive     Reply Quote
Profile FalconFly
Avatar

Send message
Joined: 2 Sep 04
Posts: 121
Credit: 592,214
RAC: 0
Message 13920 - Posted: 10 Jun 2006, 16:46:08 UTC - in response to Message 13915.  

I'll also look into that.

Right now, my LHC Long term debt is anywhere between 74000 and 467000 Seconds...
Scientific Network : 45000 MHz - 77824 MB - 1970 GB
ID: 13920 · Report as offensive     Reply Quote
Profile Nightbird

Send message
Joined: 13 Jul 05
Posts: 55
Credit: 41,230
RAC: 0
Message 13924 - Posted: 10 Jun 2006, 19:30:10 UTC - in response to Message 13915.  

Or you could get Boinc Debt Viewer . It has a feature that allows your to reset your debts with the click of a button.


That utility sure does simplify that process. It's kinda cool for those who need it. :)

It only works under XP.

Do you want to get banned for 31 years, your account and credits deleted at a Boinc project ? Predictor@home is your best choice.
ID: 13924 · Report as offensive     Reply Quote
Philip Martin Kryder

Send message
Joined: 21 May 06
Posts: 73
Credit: 8,710
RAC: 0
Message 13925 - Posted: 10 Jun 2006, 19:31:33 UTC - in response to Message 13905.  

...handles Long Term debt, it actually and continually shrinks the effective Cache of all machines....


Can someone explain the rationale behind this Cache shrinkage due to long term debt?


ID: 13925 · Report as offensive     Reply Quote
Bill Hepburn

Send message
Joined: 18 Sep 04
Posts: 10
Credit: 5,127,937
RAC: 101
Message 13929 - Posted: 11 Jun 2006, 1:56:43 UTC - in response to Message 13925.  

...handles Long Term debt, it actually and continually shrinks the effective Cache of all machines....


Can someone explain the rationale behind this Cache shrinkage due to long term debt?




The description in the Wiki is confusing, at best. This is how I understand it. I could be wrong.

First, there really is no such thing as a cache setting. BOINC loads enough work to keep the machine busy until the next "Connect Interval". A long "connect interval" will, of course, mean more work is loaded at one time. There really is no "load x days work" setting.

Long term debt determines what program is queried for work. When Boinc needs to download (Download OK mode), it tries to get work from projects with a high LTD. It won't try to get work from projects with low LTD. Later, it decides that it must get work from somewhere because it is going to run out of work (Download required mode), it will get work from anyplace that has work, regardless of LTD.

So, I think in the case when LHC is out of work, it typically will have a relatively large LTD. As a result, BOINC won't download work until the scheduler goes into "Download Required".

ID: 13929 · Report as offensive     Reply Quote
Philip Martin Kryder

Send message
Joined: 21 May 06
Posts: 73
Credit: 8,710
RAC: 0
Message 13930 - Posted: 11 Jun 2006, 3:15:05 UTC - in response to Message 13929.  

...handles Long Term debt, it actually and continually shrinks the effective Cache of all machines....


Can someone explain the rationale behind this Cache shrinkage due to long term debt?




The description in the Wiki is confusing, at best. This is how I understand it. I could be wrong.

First, there really is no such thing as a cache setting. BOINC loads enough work to keep the machine busy until the next "Connect Interval". A long "connect interval" will, of course, mean more work is loaded at one time. There really is no "load x days work" setting.

Long term debt determines what program is queried for work. When Boinc needs to download (Download OK mode), it tries to get work from projects with a high LTD. It won't try to get work from projects with low LTD. Later, it decides that it must get work from somewhere because it is going to run out of work (Download required mode), it will get work from anyplace that has work, regardless of LTD.

So, I think in the case when LHC is out of work, it typically will have a relatively large LTD. As a result, BOINC won't download work until the scheduler goes into "Download Required".


There is no "cache setting" - got it.

What seems not to match my experience is:
1)The earlier posts about LHC being limited to 10000 LTD - mine was much higher.
Some of my other projects went negative.

2) your second to last paragraph seems to say that projects with a HIGH LTD will try to get work in Download OK Mode.
While your last paragraph seems to say that it won't go to "LARGE LTD" (is large different than high?) until "Download Required" mode...

So, If I have a High (?Large? LTD) - shouldn't that attract work?

Sorry to be so dense.

Phil

ID: 13930 · Report as offensive     Reply Quote
Bill Hepburn

Send message
Joined: 18 Sep 04
Posts: 10
Credit: 5,127,937
RAC: 101
Message 13932 - Posted: 11 Jun 2006, 6:00:02 UTC - in response to Message 13930.  
Last modified: 11 Jun 2006, 6:08:07 UTC


The description in the Wiki is confusing, at best. This is how I understand it. I could be wrong.

First, there really is no such thing as a cache setting. BOINC loads enough work to keep the machine busy until the next "Connect Interval". A long "connect interval" will, of course, mean more work is loaded at one time. There really is no "load x days work" setting.

Long term debt determines what program is queried for work. When Boinc needs to download (Download OK mode), it tries to get work from projects with a high LTD. It won't try to get work from projects with low LTD. Later, it decides that it must get work from somewhere because it is going to run out of work (Download required mode), it will get work from anyplace that has work, regardless of LTD.

So, I think in the case when LHC is out of work, it typically will have a relatively large LTD. As a result, BOINC won't download work until the scheduler goes into "Download Required".


There is no "cache setting" - got it.

What seems not to match my experience is:
1)The earlier posts about LHC being limited to 10000 LTD - mine was much higher.
Some of my other projects went negative.

2) your second to last paragraph seems to say that projects with a HIGH LTD will try to get work in Download OK Mode.
While your last paragraph seems to say that it won't go to "LARGE LTD" (is large different than high?) until "Download Required" mode...

So, If I have a High (?Large? LTD) - shouldn't that attract work?

Sorry to be so dense.

Phil


As I understand it, the high LTD on LHC would attract work, but (in most cases) LHC has no work to give. Since LHC can't comply, and the other projects have a low LTD, no work comes in until BOINC decides that it is about to run out of work and will take it from any place it can get it.

The LTDs periodically are recomputed, so they sum to zero -- some get pushed negative (which is a low debt). For some reason, LHC never seemed to accumulate the much larger numbers you mention and I have no idea of why.

I'm really not very good at explaining this whole thing, since my understanding is pretty fuzzy. It does sort of make sense in the perverse way of all things computer.

Cheers.
ID: 13932 · Report as offensive     Reply Quote
Profile Steve Cressman
Avatar

Send message
Joined: 28 Sep 04
Posts: 47
Credit: 6,394
RAC: 0
Message 13933 - Posted: 11 Jun 2006, 6:02:22 UTC - in response to Message 13913.  
Last modified: 11 Jun 2006, 6:30:20 UTC

To get a LTD around 10000 you probably have a switch between applications time of 180 minutes


Precisely. It seems it ought to continue to accumulate debt beyond that, but it doesn't.

One of life's mysteries methinks.

I'm not actually sure of all the details of the scheduler that JM VII wrote but the LTD is what is used to determine if it should ask for work. With your switch time of 180 minutes it will continue to ask for work until the LTD reaches -10800. Then it decides that there is too much debt when it goes beyond that value and prevents further d/l until the debt again rises above -10800. Then it starts asking for work again etc etc. Also you will find that as the value gets closer to -10800 it will ask for less and less work each time until it exceeds the limit of -10800.

Depending on the value of your switch time you can substitute the values below into the above statement.
60 min = 3600 sec
120 min = 7200 sec
180 min = 10800 sec
240 min = 14400 sec
etc.
:)
98SE XP2500+ @ 2.1 GHz Boinc v5.8.8
ID: 13933 · Report as offensive     Reply Quote
Bill Hepburn

Send message
Joined: 18 Sep 04
Posts: 10
Credit: 5,127,937
RAC: 101
Message 13934 - Posted: 11 Jun 2006, 6:22:38 UTC - in response to Message 13933.  


Precisely. It seems it ought to continue to accumulate debt beyond that, but it doesn't.

One of life's mysteries methinks.

I'm not actually sure how JMVII wrote the scheduler but the LTD is what is used to determine if it should ask for work. With your switch time of 180 minutes it will continue to ask for work until the LTD reaches -10800. Then it decides that there is too much debt when it goes beyond that value and prevents further d/l until the debt again rises above -10800. Then it starts asking for work again etc etc
:)

That makes as much sense as anything. I'm amazed at how well it works on the whole. I'm attached to some combination of five projects, all with equal resource shares, and most of the time, it merrily rotates through them.
ID: 13934 · Report as offensive     Reply Quote
Profile Steve Cressman
Avatar

Send message
Joined: 28 Sep 04
Posts: 47
Credit: 6,394
RAC: 0
Message 13935 - Posted: 11 Jun 2006, 6:56:38 UTC - in response to Message 13934.  


That makes as much sense as anything. I'm amazed at how well it works on the whole. I'm attached to some combination of five projects, all with equal resource shares, and most of the time, it merrily rotates through them.


You know what is fun? Go into the client_state.xml file and delete the negative signs from the debt values and then watch the scheduler freak out for a while as it tries to make all the values add up to zero again.

I was one of the testers for JM VII when he was first making the scheduler and I did a lot of testing of the scheduler. Was lots of fun trying to break it but I found it to be very robust. Even doing something like what I suggested above did not bother it too much. It would eventually get back to what it should be with all the values adding to zero.

I know, geeks have a stange way of having fun , lol.
:)
98SE XP2500+ @ 2.1 GHz Boinc v5.8.8
ID: 13935 · Report as offensive     Reply Quote
NJMHoffmann

Send message
Joined: 26 Nov 05
Posts: 16
Credit: 14,707
RAC: 0
Message 13938 - Posted: 11 Jun 2006, 8:42:39 UTC - in response to Message 13934.  

That makes as much sense as anything.
Because it definitely is wrong. My switch time is 120 mins and my LTD are bigger than +-7200. I think Steve mixed LTD and STD. LTD has no influence in how much work is asked for but only in the decision if work is asked for. And to keep track of "long term debt" it must keep bigger values.

Norbert
ID: 13938 · Report as offensive     Reply Quote
Profile FalconFly
Avatar

Send message
Joined: 2 Sep 04
Posts: 121
Credit: 592,214
RAC: 0
Message 13941 - Posted: 11 Jun 2006, 14:17:18 UTC - in response to Message 13938.  

Since the BOINC Debt viewer is not really suitable for Networked operation, I went ahead and Suspended LHC + Reset Project around about 10 times.

Immediately, the caches filled again upto normal values (and for the record "Connect to Network every X days" equals "Set local Cache size to X days worth of work" (which more or less will work out, depending on how accurate BOINC guesstimated the machine's performance)

Looks like I'll just wait until manually seeing LHC alive again, then temporarily switching LHC back on.

All in all, I assume this BOINC behaviour is an old but still untouched Scheduler Bug...
Scientific Network : 45000 MHz - 77824 MB - 1970 GB
ID: 13941 · Report as offensive     Reply Quote
1 · 2 · Next

Message boards : Number crunching : Solution for LHC Long Term debt problem ?


©2024 CERN