Message boards :
Number crunching :
'0 cpu' problem solved?
Message board moderation
Author | Message |
---|---|
Send message Joined: 17 Sep 04 Posts: 53 Credit: 1,752,270 RAC: 1,601 |
|
Send message Joined: 2 Sep 04 Posts: 545 Credit: 148,912 RAC: 0 |
I just did a scan back ... it looks like it is still with us ... For example: 97454 93563 89767 85772 84737 I am not sure all qualify as being a problem for sure... But, the rate "seems" lower to me ... Dr. Anderson said his fix would work if either end was done, I wonder though if both ends need to be done to kill all the problems ... |
Send message Joined: 29 Sep 04 Posts: 187 Credit: 705,487 RAC: 0 |
I didn't look at many, but at least one of those results you link to was crunched with a 4.64 client which does not have the fix in. Wave upon wave of demented avengers march cheerfully out of obscurity into the dream. |
Send message Joined: 23 Oct 04 Posts: 358 Credit: 1,439,205 RAC: 0 |
hi Paul, let me give you some note from me: > I just did a scan back ... it looks like it is still with us ... > > For example: > 97454 > 93563 > 89767 ----- This are s16_ or s18_ = 'fast WU's', as Chrulle discribed in another post: They have a bigger amplitude at the beginn ('bang') of the simulation and could abort. > 85772 ----- This is a realy 0 CPU time example and it is the contrary of PoorBoy's 'miracle' . 2 have reported 0 and 1 has claimed something -> median is zero = 0.00 granted credits > 84737 ----- This one is the worsest case, I can't explain them. I have myself about 12 such results with pending credits(~1000) ! A multiple problem: a '0 CPU time'- and a validation- problem. > > I am not sure all qualify as being a problem for sure... > > But, the rate "seems" lower to me ... ----- Yes > > Dr. Anderson said his fix would work if either end was done, I wonder though > if both ends need to be done to kill all the problems ... > > greetz littleBouncer |
Send message Joined: 29 Sep 04 Posts: 187 Credit: 705,487 RAC: 0 |
The zero CPU time in 84737 was crunched with 4.64 which does not have the fix in it, no reason to be suprised it has a zero CPU time. Wave upon wave of demented avengers march cheerfully out of obscurity into the dream. |
Send message Joined: 23 Oct 04 Posts: 358 Credit: 1,439,205 RAC: 0 |
@ Paul From the 120 WU's processed since yesterday there were: s16_ : 23 WU 6 with 0 CPU-time s18_ : 15 WU 8 with 0 CPU-time IMO: mostly of them were as I call: 'fast WU's?', has nothing to do with the '0 CPU-time-problem'. 0 CPU-time: only 1 (s14_ , real) WU from 120 = ~1% greetz littleBouncer |
Send message Joined: 2 Sep 04 Posts: 545 Credit: 148,912 RAC: 0 |
Yeah, I forgot when the fix went in ... But, the other ones ... I am not sure if they are truly fast or not. Usually, even the fast ones get a couple seconds ... to be honest, maybe the application should be changed to record a second as soon as the WU starts. So, if the WU does die within the first second it still gets a nod ... of course, if there is data in the output Result Data File this also may be detectable. But, it still looks like I have at LEAST one that cannot be explained correctly. The problem looks like it is better, I am just unconvinced that it is completely cured. That is all I am saying. 85772, does look like it still brings out a problem with the credit calculations with only one with time and a claim ... |
Send message Joined: 17 Sep 04 Posts: 53 Credit: 1,752,270 RAC: 1,601 |
|
Send message Joined: 2 Sep 04 Posts: 378 Credit: 10,765 RAC: 0 |
I've seen only a couple in the last 40 http://lhcathome.cern.ch/workunit.php?wuid=146969 http://lhcathome.cern.ch/workunit.php?wuid=142825 http://lhcathome.cern.ch/workunit.php?wuid=142820 http://lhcathome.cern.ch/workunit.php?wuid=131781 I'm not the LHC Alex. Just a number cruncher like everyone else here. |
Send message Joined: 2 Sep 04 Posts: 545 Credit: 148,912 RAC: 0 |
> I've seen only a couple in the last 40 When I uploaded my logs for last week I got an additional 40-50 increase. We also seem to have overloaded their ability to handle the load ... :( |
Send message Joined: 2 Sep 04 Posts: 22 Credit: 4,038,144 RAC: 0 |
Hello! Oh no, I dont think so, take a look at this WU: http://lhcathome.cern.ch/result.php?resultid=1107178 And it is not only me that got this one! Hans Sveen Oslo, Norway |
Send message Joined: 1 Sep 04 Posts: 26 Credit: 600,998 RAC: 0 |
> Hello! > Oh no, I dont think so, take a look at this WU: > http://lhcathome.cern.ch/result.php?resultid=1107178 > > And it is not only me that got this one! > The WU you are talking about is v64D1D2MQonlyinjnoskew1b5offcomp-60s16_18525.9615_1_sixvf_45886_4 (notice the bolded s16_) As littleBouncer said best.... >This are s16_ or s18_ = 'fast WU's', as Chrulle discribed in another post: They >have a bigger amplitude at the beginn ('bang') of the simulation and could abort. :) Puffy |
Send message Joined: 2 Sep 04 Posts: 545 Credit: 148,912 RAC: 0 |
Yes, but I would still wish that they would put one second on the clock so that we know that it was a failure of the simulation, and not a failure of the checkpointing. |
©2024 CERN