Message boards : Number crunching : '0 cpu' problem solved?
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Contact
Avatar

Send message
Joined: 17 Sep 04
Posts: 53
Credit: 1,752,270
RAC: 1,601
Message 7200 - Posted: 26 Apr 2005, 3:20:25 UTC

Crunchin’ LHC has become at treat.
New v64lhc units now almost always give credit and resume properly after suspended.
Big news!
Cheers to all.

ID: 7200 · Report as offensive     Reply Quote
Profile Paul D. Buck

Send message
Joined: 2 Sep 04
Posts: 545
Credit: 148,912
RAC: 0
Message 7286 - Posted: 28 Apr 2005, 12:34:43 UTC

I just did a scan back ... it looks like it is still with us ...

For example:
97454
93563
89767
85772
84737

I am not sure all qualify as being a problem for sure...

But, the rate "seems" lower to me ...

Dr. Anderson said his fix would work if either end was done, I wonder though if both ends need to be done to kill all the problems ...

ID: 7286 · Report as offensive     Reply Quote
Profile adrianxw

Send message
Joined: 29 Sep 04
Posts: 187
Credit: 705,487
RAC: 0
Message 7287 - Posted: 28 Apr 2005, 12:43:24 UTC

I didn't look at many, but at least one of those results you link to was crunched with a 4.64 client which does not have the fix in.

Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 7287 · Report as offensive     Reply Quote
Profile littleBouncer
Avatar

Send message
Joined: 23 Oct 04
Posts: 358
Credit: 1,439,205
RAC: 0
Message 7288 - Posted: 28 Apr 2005, 12:50:01 UTC - in response to Message 7286.  
Last modified: 28 Apr 2005, 13:06:43 UTC

hi Paul,
let me give you some note from me:

> I just did a scan back ... it looks like it is still with us ...
>
> For example:
> 97454
> 93563
> 89767
-----
This are s16_ or s18_ = 'fast WU's', as Chrulle discribed in another post: They have a bigger amplitude at the beginn ('bang') of the simulation and could abort.

> 85772
-----
This is a realy 0 CPU time example and it is the contrary of PoorBoy's 'miracle' . 2 have reported 0 and 1 has claimed something -> median is zero = 0.00 granted credits

> 84737
-----
This one is the worsest case, I can't explain them. I have myself about 12 such results with pending credits(~1000) !
A multiple problem: a '0 CPU time'- and a validation- problem.

>
> I am not sure all qualify as being a problem for sure...
>
> But, the rate "seems" lower to me ...
-----
Yes
>
> Dr. Anderson said his fix would work if either end was done, I wonder though
> if both ends need to be done to kill all the problems ...
>
>

greetz littleBouncer
ID: 7288 · Report as offensive     Reply Quote
Profile adrianxw

Send message
Joined: 29 Sep 04
Posts: 187
Credit: 705,487
RAC: 0
Message 7291 - Posted: 28 Apr 2005, 13:44:31 UTC

The zero CPU time in 84737 was crunched with 4.64 which does not have the fix in it, no reason to be suprised it has a zero CPU time.

Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 7291 · Report as offensive     Reply Quote
Profile littleBouncer
Avatar

Send message
Joined: 23 Oct 04
Posts: 358
Credit: 1,439,205
RAC: 0
Message 7292 - Posted: 28 Apr 2005, 14:03:16 UTC - in response to Message 7286.  
Last modified: 28 Apr 2005, 15:21:32 UTC

@ Paul

From the 120 WU's processed since yesterday there were:
s16_ : 23 WU 6 with 0 CPU-time
s18_ : 15 WU 8 with 0 CPU-time

IMO: mostly of them were as I call: 'fast WU's?', has nothing to do with the '0 CPU-time-problem'.

0 CPU-time:
only 1 (s14_ , real) WU from 120 = ~1%

greetz littleBouncer
ID: 7292 · Report as offensive     Reply Quote
Profile Paul D. Buck

Send message
Joined: 2 Sep 04
Posts: 545
Credit: 148,912
RAC: 0
Message 7297 - Posted: 28 Apr 2005, 15:44:14 UTC

Yeah, I forgot when the fix went in ...

But, the other ones ... I am not sure if they are truly fast or not. Usually, even the fast ones get a couple seconds ...

to be honest, maybe the application should be changed to record a second as soon as the WU starts. So, if the WU does die within the first second it still gets a nod ... of course, if there is data in the output Result Data File this also may be detectable.

But, it still looks like I have at LEAST one that cannot be explained correctly.

The problem looks like it is better, I am just unconvinced that it is completely cured. That is all I am saying.

85772, does look like it still brings out a problem with the credit calculations with only one with time and a claim ...
ID: 7297 · Report as offensive     Reply Quote
Profile Contact
Avatar

Send message
Joined: 17 Sep 04
Posts: 53
Credit: 1,752,270
RAC: 1,601
Message 7385 - Posted: 2 May 2005, 2:24:57 UTC

Spent a bit o’ time rummaging through the stats and I can’t find any clients > 4.3x that report valid with 0 credit in the last few days.
Could be wrong…but fer sure this complaint is no longer a reason to turn your back to this project.
Full steam ahead!

ID: 7385 · Report as offensive     Reply Quote
Profile Alex

Send message
Joined: 2 Sep 04
Posts: 378
Credit: 10,765
RAC: 0
Message 7386 - Posted: 2 May 2005, 4:10:58 UTC - in response to Message 7385.  

I've seen only a couple in the last 40

http://lhcathome.cern.ch/workunit.php?wuid=146969

http://lhcathome.cern.ch/workunit.php?wuid=142825

http://lhcathome.cern.ch/workunit.php?wuid=142820

http://lhcathome.cern.ch/workunit.php?wuid=131781


I'm not the LHC Alex. Just a number cruncher like everyone else here.
ID: 7386 · Report as offensive     Reply Quote
Profile Paul D. Buck

Send message
Joined: 2 Sep 04
Posts: 545
Credit: 148,912
RAC: 0
Message 7390 - Posted: 2 May 2005, 13:50:04 UTC - in response to Message 7386.  

> I've seen only a couple in the last 40

When I uploaded my logs for last week I got an additional 40-50 increase.

We also seem to have overloaded their ability to handle the load ... :(
ID: 7390 · Report as offensive     Reply Quote
Hans Sveen

Send message
Joined: 2 Sep 04
Posts: 22
Credit: 4,038,144
RAC: 0
Message 7559 - Posted: 9 May 2005, 23:38:40 UTC
Last modified: 9 May 2005, 23:39:24 UTC

Hello!
Oh no, I dont think so, take a look at this WU:
http://lhcathome.cern.ch/result.php?resultid=1107178

And it is not only me that got this one!


Hans Sveen
Oslo, Norway


ID: 7559 · Report as offensive     Reply Quote
Profile JigPu

Send message
Joined: 1 Sep 04
Posts: 26
Credit: 600,998
RAC: 0
Message 7560 - Posted: 9 May 2005, 23:59:17 UTC - in response to Message 7559.  
Last modified: 9 May 2005, 23:59:44 UTC

> Hello!
> Oh no, I dont think so, take a look at this WU:
> http://lhcathome.cern.ch/result.php?resultid=1107178
>
> And it is not only me that got this one!
>
The WU you are talking about is v64D1D2MQonlyinjnoskew1b5offcomp-60s16_18525.9615_1_sixvf_45886_4 (notice the bolded s16_)

As littleBouncer said best....
>This are s16_ or s18_ = 'fast WU's', as Chrulle discribed in another post: They
>have a bigger amplitude at the beginn ('bang') of the simulation and could abort.

:)

Puffy
ID: 7560 · Report as offensive     Reply Quote
Profile Paul D. Buck

Send message
Joined: 2 Sep 04
Posts: 545
Credit: 148,912
RAC: 0
Message 7573 - Posted: 10 May 2005, 14:38:03 UTC

Yes, but I would still wish that they would put one second on the clock so that we know that it was a failure of the simulation, and not a failure of the checkpointing.
ID: 7573 · Report as offensive     Reply Quote

Message boards : Number crunching : '0 cpu' problem solved?


©2024 CERN