21) Message boards : Number crunching : Stil a pending credit (Message 18621)
Posted 20 Nov 2007 by EclipseHA
Post:

Here's one from 2005 that was granted credit:

http://lhcathome.cern.ch/lhcathome/result.php?resultid=694330

If you try to look at the WU, you get an error that says "workunit not found"


There is a difference between pending and orphaned. That resultID is an orphan, which is a different issue. This thread is speaking specifically about the 0.00xxx claimed credit units. The host 80808, which I assume is your host, has only 4 such results, all from October of this year. You will note that they are pending.

One should also take care to note that there is a difference between a result being orphaned and a result not being assimilated and purged. If a workunit header is found and the results come up along with it and the results have been validated and issued credit, that is a result that hasn't been through the assimilation process.

Like I said, I know for some that want to clean up their hosts, a delete is the "quick and easy" way from their perspective. From a scientific perspective, it should be in a different order. The way I'd do it?


  • Orphan ResultID: For every result that does not have a matching WU header, delete Result.
  • Validated WU that hasn't been Assimilated: Look at assimilator code to find out the cause and correct the cause.
  • 0.00xx claimed credit results that aren't validating: Look at validator code to find the cause and correct the cause.



Of course, all of this might be taken care of by the BOINC server-side code upgrade that still needs to be performed...

IMO, YMMV, etc, etc, etc...



I understand that there are 2 different causes to the "zombie" WU's.. I've got about 30 of the 0.00xx pending, and maybe 20 of the "long lost granted" WU's (and two that errored out long ago...).

In both cases, however, it does incdicate a cleanup is in order!
22) Message boards : Number crunching : Stil a pending credit (Message 18610)
Posted 18 Nov 2007 by EclipseHA
Post:

Hey, I got some of these 0.00xxx results from April.. If they're not in the DB now, they never will be!

In total, I got about 50 of these 0.00xx results, and some that were actually granted real credit back in 2005!

Housekeeping time!


If the results are listed as pending, this means that the validators have never looked at them. Since the validators never looked at them, they cannot have been fed to the assimilation process to be inserted into the science database.

Your computers are hidden, so I can't verify your claim as to results having been issued credit. If so, then you might could look to see if everyone else that was assigned that same WU has been validated / invalidated. Any pending status will probably hold up the process that handles transitioning / assimilation / purging.

As I said before, even though the tasks failed very quickly on our machines, the data collected still could be of some worth. It would be far better to get the results to pass through the system the "normal" way than to just flat out delete them. I understand it would be easier for some to have them deleted, particularly those who wish to merge / delete hosts.


Here's one from 2005 that was granted credit:

http://lhcathome.cern.ch/lhcathome/result.php?resultid=694330

If you try to look at the WU, you get an error that says "workunit not found"


23) Message boards : Number crunching : Stil a pending credit (Message 18603)
Posted 17 Nov 2007 by EclipseHA
Post:

I now have 31 pending and all are showing the above problem (0.00xxxxxx claim).
Can these be validated or killed please?


I thought about suggesting zapping them all a few weeks ago, but then the thought occurred to me that the results may need to be inserted into the science database. Even if the result indicates an instantaneous containment failure, that's needed to be known so that it doesn't happen in the real-world application...

So, Neasan and Alex, what's up? Is someone working on getting the server upgrade performed? Would this by chance be some old validator code?

Brian



Hey, I got some of these 0.00xxx results from April.. If they're not in the DB now, they never will be!

In total, I got about 50 of these 0.00xx results, and some that were actually granted real credit back in 2005!

Housekeeping time!
24) Message boards : Number crunching : Stil a pending credit (Message 18546)
Posted 4 Nov 2007 by EclipseHA
Post:
Actually, I don't care if I ever see the "credit" from the (now about 30) 0.00xxxx pending WU's, but I'd really like to see them vanish from my pending credit list!

Same with the WU's that were granted credit as far back as 2005!
25) Message boards : Number crunching : Stil a pending credit (Message 18530)
Posted 3 Nov 2007 by EclipseHA
Post:
with the WU's available over the last couple weeks, I now have about 20 "pending", where the claimed is of the 0.00xxxx variety.

26) Message boards : Number crunching : Stil a pending credit (Message 18314)
Posted 19 Oct 2007 by EclipseHA
Post:
While current (completed) results seem to being purged after a few days, good results that are months old are still hanging around.

I have some from 2005.

Also, pending from this last run, as well as April 2007, where there was a credit claim of 0. (marked as "pending").
27) Message boards : Number crunching : Initial Replication (Message 18225)
Posted 17 Oct 2007 by EclipseHA
Post:
Other than the fact that some WU's may get crunched 2 more times than needed (with credit granted), I'm not sure where this is causing harm. Sure you're using electricity, but it's up to the project.

People have been complaining about "lack of work" here for years, and to cut IR from 5 to 3 means that there's 40% less work right off the bat.

Right now, today, LHC, has taken some measures to keep work in the pipeline longer - the 2/day/cpu, the 1h delay, etc. with the press release and all.

I think we should all just step back and be happy that there has been a flow of work (be it 2/day) for the longest time I've seen in years.

If you don't like the way the project is being managed, speak with your feet and crunch for another project.
28) Message boards : Number crunching : Stil a pending credit (Message 17582)
Posted 28 Jul 2007 by EclipseHA
Post:
WU's pending are not the only problem. The number of results and workunits awaiting deletion is enormous. This is preventing everyone from deleting hosts that are no longer active or will not merge.

All in all, a good general housekeepping is in order.



Good point about the hosts.. I don't have any extras right now, but I got a boatload of ghost WUs in my history!

Time for housecleaning, I agree!

29) Message boards : Number crunching : Stil a pending credit (Message 17555)
Posted 26 Jul 2007 by EclipseHA
Post:
I agree - time to clean up the database. I got WUs from April which will be pending forever.

Once the test batch is over, it's time to tidy up!
30) Message boards : Number crunching : Can't Access Work Units (Message 17506)
Posted 23 Jul 2007 by EclipseHA
Post:
OK just to add we have it sorted that we can get this fixed quicker if it happens again but it shouldn't as we've solved the problem.


When, exactly, did you solve the problem?



I think it's safe to say they "thought" they solved the problem! :)

That's just an example why this test run could do some good!

It is kind of interesting that ~10% of the work in the test batch is now queued for retransmission after only a couple days - seems like a high error rate to me!

(another example why the test run could do some good!)
31) Message boards : Number crunching : Can't Access Work Units (Message 17503)
Posted 23 Jul 2007 by EclipseHA
Post:
Seems the server shut itself down again...

Right now, the stas on the main page show 1899 WUs available, but I get "now work available" when trying to get some....
32) Message boards : Number crunching : Bad thread priority (Message 17490)
Posted 22 Jul 2007 by EclipseHA
Post:
"Sluggish" could also be due to more than the thread priority - for example, page faults. What's your memory usage look like?

I have had this problem with other projects, and I I'm absolutely sure it got working right after I manually changed the priority.


Do you know the tool you're using is giving you the right information? (I REALLY doubt that any BOINC related thread is running at "realtime" under windows)

Do you understand what changing the priority really means to your whole system?

Why are you the only one having this problem with LHC and other projects?

Sounds like there's something weird with your system, and you're just using a bandaid.....


Just my opinion
33) Message boards : Number crunching : Bad thread priority (Message 17474)
Posted 21 Jul 2007 by EclipseHA
Post:
"Sluggish" could also be due to more than the thread priority - for example, page faults. What's your memory usage look like?
34) Message boards : Number crunching : Even if there's no "real work", may I suggest a "real test"? (Message 17385)
Posted 19 Jul 2007 by EclipseHA
Post:
Seems that "outstanding work" is still at ~35 days after 1000 test WU's were released. Could it be the "ghost" problem as the numbers haven't changed much today?

I think doing a real load test would be a good thing for this project. It's no more "wasted cycles" than SETI which seems to crunch the same data over and over, and infact would be good for this project.

New servers in a new location on the net, and only testing with enough work to last less than 10 minutes and takes days to come back isn't really a valid test for when a real dump of work becomes available, IMHO.

Nows the time to really test the infrastructure, and not when real data might be lost/delayed.
35) Message boards : Number crunching : Even if there's no "real work", may I suggest a "real test"? (Message 17366)
Posted 16 Jul 2007 by EclipseHA
Post:
Much has changed since there was real work in the pipline - the severs have moved, looks like project directories have changed, etc...

There have been a couple of small bursts of WU's for testing, but, as many people have the resource for LHC set really high, when work is available, it goes to only a few machines.

As a result, not many of the clients have been tested, and there probably hasn't been that much of a load on the servers. Last time there was "real work" most of the WU's I got were ghosts.. (like 15 of 20)

How about using old data for a real test run before new data hits the pipline? I'm thinking on the order of 100K WUs.

Work out stuff now before real data is comprimised, is all I'm suggesting... (on both the client and server end)
36) Message boards : Number crunching : Few test jobs in flight! (Message 17083)
Posted 22 Jun 2007 by EclipseHA
Post:
with only 56 WU's issued today, I'll bet that <5 hosts got them all..

More like 2-3 hosts.

You can tell as they're being completed VERY slowly.

Is this really a valid test, as it's such a small subset of the user base.

Maybe issue 50,000 OLD WU's and run the system thru it's paces before new data is crunched? That might almost be a valid test.. 50 sure isn't!
37) Message boards : Number crunching : How do we (Message 16916)
Posted 18 May 2007 by EclipseHA
Post:
The idea is that to migrate you will have to do nothing!

Techincally we will change the cern.ch DNS entry to point to an IP address in the Queen Mary, University of London network (138.37.0.0/16).
The rest should then fall into place!


Alex Owen
LHC@Home server admin


The best solution, but be forewarned that it could take a couple days for the changes to propagate. No big problem, but people might see some weird stuff during that time (some machines get to the new servers, while other dont, etc)
38) Message boards : Number crunching : The ghosts in the machine (Message 16811)
Posted 4 May 2007 by EclipseHA
Post:
Well, this round I got 5 ghost WU's where for the last round I got 20 ghosts and one that was actually sent.

(by ghost, I mean that per the website I have them, but the computer was never sent the WU and the logs show it).

Does anyone else see this? (check "results" under "My Account")
39) Message boards : Number crunching : Good news and bad news. (Message 16732)
Posted 24 Apr 2007 by EclipseHA
Post:
Got home today and found one of my systems crunching a lhc WU. It was the only one on that system.

I checked the website,ad according to that I has 20 additional WU's on that machine. Check the log, and I can osee that only one was actually sent!

Anybody else see a similar thing? The website thinks it sent many more than it actually did?

(the 20 "lost" WU's show as if they're cool, and just waiting for results..)
40) Message boards : Number crunching : Because you asked.... (Message 16714)
Posted 15 Apr 2007 by EclipseHA
Post:
Didn't CERN just have a fairly major problem? I heard it on the news while driving.. (Paul Harvey, I think).. A week or so back..

The "new ring", where they used the the wrong unit's for calcs (like MPH/KPH but that's not it), but a 20 ton magnet kind of got wacked out of shape during a test?

Could we have crunched the bad data?



Previous 20 · Next 20


©2023 CERN