Message boards : Number crunching : cpu time ok... but zero credits granted
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3

AuthorMessage
Vid Vidmar*
Avatar

Send message
Joined: 28 Sep 04
Posts: 27
Credit: 17,091
RAC: 0
Message 6255 - Posted: 2 Mar 2005, 10:33:53 UTC

Man! A state-of-the-art app. that cant tell time. Wonder, who's the fool here.
Will keep it running just for good laughs. ;]]

ID: 6255 · Report as offensive     Reply Quote
STE\/E

Send message
Joined: 2 Sep 04
Posts: 352
Credit: 1,393,150
RAC: 0
Message 6256 - Posted: 2 Mar 2005, 10:51:57 UTC

Yes, as long as my Daily Credits don't drop to far I'll hang with it for a while longer, but there comes a point I'll have to get in the Life Raft myself if things get to bad ... hehe
ID: 6256 · Report as offensive     Reply Quote
Vid Vidmar*
Avatar

Send message
Joined: 28 Sep 04
Posts: 27
Credit: 17,091
RAC: 0
Message 6257 - Posted: 2 Mar 2005, 11:09:19 UTC

I've been investigating a bit further... And was unable to find to which file LHC stores progress info. Did find a .zip file that was updated at the time CC switched from LHC to another project, but didnt find anything resembling CPU time in any of zipped files.

ID: 6257 · Report as offensive     Reply Quote
Profile adrianxw

Send message
Joined: 29 Sep 04
Posts: 187
Credit: 705,487
RAC: 0
Message 6258 - Posted: 2 Mar 2005, 11:18:48 UTC
Last modified: 2 Mar 2005, 11:52:30 UTC

I have to admit that I am probably going to drop the CPU percentage for the project. Almost all of my wu's are going back and earning zero. I have it currently set for 25% of my CPU time, thats about 8 hours per day. Those 8 hours could be getting credits for my team at one of the other projects.

It seems to have got much worse recently.

Example.

Look at the results before 1st March and then after.

Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 6258 · Report as offensive     Reply Quote
STE\/E

Send message
Joined: 2 Sep 04
Posts: 352
Credit: 1,393,150
RAC: 0
Message 6259 - Posted: 2 Mar 2005, 11:27:26 UTC

It seems to have got much worse recently
=========

I agree, I haven't actually seen 1 lhc type WU turned in by me today with any time reported or at the most just a few seconds ... I'm trying to get to the ones I downloaded yesterday to see if their any better, but I have a few hours to go yet before I get into them ...
ID: 6259 · Report as offensive     Reply Quote
Vid Vidmar*
Avatar

Send message
Joined: 28 Sep 04
Posts: 27
Credit: 17,091
RAC: 0
Message 6261 - Posted: 2 Mar 2005, 11:44:49 UTC

If nothing changes for the better in a day or so, I am suspending this project until this is fixed.

ID: 6261 · Report as offensive     Reply Quote
Profile sysfried

Send message
Joined: 27 Sep 04
Posts: 282
Credit: 1,415,417
RAC: 0
Message 6287 - Posted: 2 Mar 2005, 20:55:21 UTC - in response to Message 6261.  

> If nothing changes for the better in a day or so, I am suspending this project
> until this is fixed.
>
>
good point. an update from LHC admins would be very appreciated.... if they don't know the reason, they could tell us to suspend until that is fixed....
ID: 6287 · Report as offensive     Reply Quote
Profile Paul D. Buck

Send message
Joined: 2 Sep 04
Posts: 545
Credit: 148,912
RAC: 0
Message 6293 - Posted: 2 Mar 2005, 22:10:07 UTC

Just some news ... Last week someone asked me a question about processing times and to give them facts instead of rumors I took the log files that BOINC View writes and put the data into a really ugly table (from a relational database perspective).

Anyway, I am too messed up to day to do anything hard (for me relational databases are way too easy stuff) and I decided to look to see about the 0 second problem.

What I learned:

1) I have 396 results with 0 length run times. Of these, I know some of the recent ones have had the 10 hour runtimes, because I watched them running and they were well into a high run time.

2) in the same time frame I have had 3141 returns with non-zero final times. Failures are therefore about a 12% rate.

3) There is no way for me to tell based on the numbers in the logs as to what the "size" of the model was, since we have a spread of 10,000; 100,000; and 1,000,000 turn work units theory says that the distribution would be over the three types. My informal observation says that the failures occur with the larger run time models rather than the shorter ones.

4) Application versions include 4.45; 4.46, 4.47, 4.03, and 4.64

5) The errors seem to cover the entier period for which I have data starting roughly September of last year. This seems to be an indication that this is a long standing problem.

Questions:
1) Is there a way to distinguish the programmed run length from the Work Unit or Result Name?

2) Does anyone else have these log files available?

ID: 6293 · Report as offensive     Reply Quote
Profile sysfried

Send message
Joined: 27 Sep 04
Posts: 282
Credit: 1,415,417
RAC: 0
Message 6301 - Posted: 3 Mar 2005, 7:50:26 UTC - in response to Message 6293.  

Nice access to the problem.... I don't have boincview, but I think I have something that helps you.....

http://lhcathome.cern.ch/result.php?resultid=186717
Name v64lhc88-43s10_12545_1_sixvf..... (100.000 Turns)

http://lhcathome.cern.ch/result.php?resultid=186408
Name v64boince6ib1-13s4_6630_1_sixvf..... (1.000.000 Turns)
>
> Questions:
> 1) Is there a way to distinguish the programmed run length from the Work Unit
> or Result Name?
>
> 2) Does anyone else have these log files available?
>
>
ID: 6301 · Report as offensive     Reply Quote
STE\/E

Send message
Joined: 2 Sep 04
Posts: 352
Credit: 1,393,150
RAC: 0
Message 6304 - Posted: 3 Mar 2005, 9:19:28 UTC
Last modified: 3 Mar 2005, 9:23:28 UTC

2) in the same time frame I have had 3141 returns with non-zero final times. Failures are therefore about a 12% rate.
==========

@Paul

No offense Paul but I think your way off base on that 12% failure figure, especially since LHC has come back online the last few weeks. I just turned in 23 v64lhc type WU's off 1 of my PC's that were run over a 17 hour period and ...

16 showed no Time at all ...
3 showed 10 seconds or less...
3 showed the correct running time...

I've been seeing this sort of results for 2 weeks now on all my PC's, so if you figure the current failure rate to turn in the correct Time Result it's more like 85%-90% ...

Actually I have better results with the v64boince type WU's, now the failure rate for correct time on them is probably around 12% ... :)
ID: 6304 · Report as offensive     Reply Quote
Profile The Gas Giant

Send message
Joined: 2 Sep 04
Posts: 309
Credit: 715,258
RAC: 0
Message 6305 - Posted: 3 Mar 2005, 9:28:27 UTC

A general question. Is this problem occuring to anyone who is running LHC 100% and is not stopping and restarting BOINC?

I know I am resource sharing since LHC restarted (even during the alpha test) and during the alpha test I did not get the problem but since then I have reported 0 cpu time at least 75 to 80% of the time (Intel 3.2GHz, HT on, XP, BOINC Manager V4.24 other project Einstein, Pred and Seti).

Shame the science is good, we might get somewhere on the credit issue then.

Starting to think about reducing resource share.

Live long and crunch.


Paul
(S@H1 8888)
BOINC/SAH BETA
ID: 6305 · Report as offensive     Reply Quote
STE\/E

Send message
Joined: 2 Sep 04
Posts: 352
Credit: 1,393,150
RAC: 0
Message 6306 - Posted: 3 Mar 2005, 9:50:23 UTC
Last modified: 3 Mar 2005, 9:51:41 UTC

A general question. Is this problem occuring to anyone who is running LHC 100% and is not stopping and restarting BOINC?
==========

I'm running 7 PC's 24/7 exclusively here at the LHC Site Giant, all P4 HT CPU's in the 3.06 to 3.4 range. The problem occurs on all of them ... The v64boince type WU's seem to return the Time most of the time but I have sat here and watched the time drop from 10 hours to like 23 minutes when the WU was finished.

Now the v64lhc type WU's are totally borked when it comes to turning in the correct time, it's time to pop open the champagne bottle when 1 does actually turn in the correct time ... hehe

ID: 6306 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 677
Credit: 43,745,336
RAC: 15,254
Message 6307 - Posted: 3 Mar 2005, 10:03:26 UTC - in response to Message 6305.  

> A general question. Is this problem occuring to anyone who is running LHC
> 100% and is not stopping and restarting BOINC?
>
> I know I am resource sharing since LHC restarted (even during the alpha test)
> and during the alpha test I did not get the problem but since then I have
> reported 0 cpu time at least 75 to 80% of the time (Intel 3.2GHz, HT on, XP,
> BOINC Manager V4.24 other project Einstein, Pred and Seti).
>
> Shame the science is good, we might get somewhere on the credit issue then.
>
> Starting to think about reducing resource share.
>
> Live long and crunch.
>
>
>
I have now a situation where LHC is running 100% 24/7 on one of my hosts because it ran out Seti WU's yesterday morning. Since that it has finished 34 LHC WU's.

14 pcs 0 sec
5 pcs 1...25 sec
3 pcs 2...30 min
10 pcs 30...60 min
2 pcs 7 h 21 min

It's a 3.06 GHz Xeon, Win2000, CC 4.19, sixtrack 4.64
ID: 6307 · Report as offensive     Reply Quote
Vid Vidmar*
Avatar

Send message
Joined: 28 Sep 04
Posts: 27
Credit: 17,091
RAC: 0
Message 6308 - Posted: 3 Mar 2005, 10:05:25 UTC

That's it... Have a look:

http://lhcathome.cern.ch/results.php?userid=3120

I'm suspending this project until this zero time error is fixed. What I cant understand is, why didn't project managers set the validator to award max. credit to 0 time results if there is a >0 time result received for the same WU.

ID: 6308 · Report as offensive     Reply Quote
Profile Paul D. Buck

Send message
Joined: 2 Sep 04
Posts: 545
Credit: 148,912
RAC: 0
Message 6312 - Posted: 3 Mar 2005, 15:11:39 UTC - in response to Message 6304.  

Poorboy,

> No offense Paul but I think your way off base on that 12% failure figure,

I'm autistic, remember? I don't even notice real insults. :)

> especially since LHC has come back online the last few weeks. I just turned in
> 23 v64lhc type WU's off 1 of my PC's that were run over a 17 hour period and
> ...
>
> 16 showed no Time at all ...
> 3 showed 10 seconds or less...
> 3 showed the correct running time...

Well, all I can say is that my data is for a total of a lot more than that since September of last year.

The discriminator I used was Greater than 1 or less than one, So, basically I tested for 0 second results. But to be fully consistent I will use equal 0 and not equal zero for testing ... and since some one has indicated that there is a name discriminator, now I can use that to group the results also...

> I've been seeing this sort of results for 2 weeks now on all my PC's, so if
> you figure the current failure rate to turn in the correct Time Result it's
> more like 85%-90% ...

So, with a total of 3141 non-zero returns and 396 with 0, well, you are right, I did the 396 divided by 3141 when it should have been 396 ÷ 3537 = 11% ...

> Actually I have better results with the v64boince type WU's, now the failure
> rate for correct time on them is probably around 12% ... :)

I did not try to look only at the latest period, especially when the data indicates that the problem seems to be older than *I* expected. Even the short run WU should have something that is a low, but not zero runtime.
ID: 6312 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3

Message boards : Number crunching : cpu time ok... but zero credits granted


©2024 CERN