Code Review for Zero CPU problem => Results ....

Author	Message
Paul D. Buck Send message Joined: 2 Sep 04 Posts: 545 Credit: 148,912 RAC: 0	Message 6420 - Posted: 5 Mar 2005, 15:12:12 UTC I submitted this review results paper to the BOINC Developers list this morning. For what it is worth I will post it here for people to tell me that I am all wet ... ----------------- Well, I am not sure I have all of it, or ANY of it but I do have some disturbing things ... I will submit that I am not a "C" coder and I may even have the wrong version of the files in that I used the boinc_public side. Though, looking at the BOINC directory I seem to find the same code ... In module app_control.c the code: bool ACTIVE_TASK::handle_exited_app(unsigned long exit_code) { get_app_status_msg(); get_trickle_up_msg(); result->final_cpu_time = checkpoint_cpu_time; will set the final CPU time BACK to the checkpoint time. The problem here is, of course, the assumption is made that the checkpoint time is the time we should save on exit. In fact, this should be something like: result->final_cpu_time = result->final_cpu_time + checkpoint_cpu_time; Feel free to translate that in the more unclear C++ version to add to the confusion. At any rate, the current code resets the final time to the checkpoint time when it should be the checkpoint time that currently exists plus the time since the last checkpoint. Because checkpointing is voluntary at the discression of the BOINC API it is possible that if there has NOT been a checkpoint prior to this segment of code, that with the checkpoint time is initialized to zero at start, will set the final_cpu_time to zero. This portion of code does seem to POINT to the LHC@Home problem in that there is evidence that the susceptibility of the system to return zero CPU time is also related to checkpointing, though that evidence is not much better than rumors. Even more suggestive is the fact that the problem is primarily experienced only by LHC@Home, which has, variable execution length work units. Because of the instability of their processing this would certainly raise MY suspicions. Though not all work units are always fully zero CPU time from all returned results, or all non-zero CPU time, the very instability of the models themselves can lend themselves to the possibility that the model "dies" early on some of the systems, possibly due to insufficient stress on Homogenous Redundency (I has one WU that was done by one of my Intel processors and the other results was done by an AMD; Example: http://lhcathome.cern.ch/workunit.php?wuid=75353). Note in the previously mentioned example that the results were from 3 different processors ... ================================== Next, the URL: http://boinc.berkeley.edu/client_app.php contains the statements: * checkpoint_cpu.xml CPU time (from start of WU) at last checkpoint. Written by checkpoint_completed. Which is not correct, the file is written, when not using shared memory by "static int update_app_progress( double cpu_t, double cp_cpu_t, double /ws_t/" which is called by checkpoint_completed and boinc_finish (there may be other references outside of the BOINC_api.c module that I did not find). So, the documentation page should also carry the note that the file is not written when using shared memory segments. Also, at the bottom, there is a note about the increase in disk traffic, since I could not find examples of the fraction_done and checkpoint_cpu files, is this OBE? The statement "If on exit there is no checkpoint_cpu.xml, it means the app called exit(0) rather than boinc_finish(). In this case the core measures the child CPU itself." also does not appear to be correct. ================================= In module cs_apps.c The statements: dtime()-rp->final_cpu_time, // KLUDGE - should be result start time rp->final_cpu_time, Do not inspire confidence when not only does the comment suggest that the code is some sort of work around, but that it is, in fact, not even close to being correct either. ID: 6420 · Reply Quote

Glenn Vidra Send message Joined: 28 Sep 04 Posts: 1 Credit: 2,812 RAC: 0	Message 6421 - Posted: 5 Mar 2005, 15:43:52 UTC - in response to Message 6420. Last modified: 5 Mar 2005, 16:03:09 UTC Yes you are right on both counts, the exit point and time run isn't being written and the same thing happen on Einstein and supposedly was an easy fix so why not contact them and I'm sure they will be willing to give a hand. You guys are trying to complete a lot of work in a short period of time and Mr Murphy has decided to add his two cents your doing a great job keep it up..As far as taking down the project until the ZERO credit problem is resolved, leave it up losing a few credits will not make any difference to me as long as you don't lose the data.. Thats My Story And I'm Sticking To IT!!! ID: 6421 · Reply Quote

Paul D. Buck Send message Joined: 2 Sep 04 Posts: 545 Credit: 148,912 RAC: 0	Message 6423 - Posted: 5 Mar 2005, 17:40:22 UTC Glenn, I did better than that. I posted it to the developer's mailing list so that all developers that subscribe (and all non-developers like me that subscribe also) to that list can see what I think I have found. ID: 6423 · Reply Quote

Alex Send message Joined: 2 Sep 04 Posts: 378 Credit: 10,765 RAC: 0	Message 6427 - Posted: 5 Mar 2005, 19:49:39 UTC I think the code you're looking at is designed to handle the 'unexpected exit', ID: 6427 · Reply Quote

Paul D. Buck Send message Joined: 2 Sep 04 Posts: 545 Credit: 148,912 RAC: 0	Message 6430 - Posted: 5 Mar 2005, 21:35:52 UTC - in response to Message 6427. Last modified: 5 Mar 2005, 21:51:44 UTC > I think the code you're looking at is designed to handle the 'unexpected > exit', It may very well be, but what is to say that the system is not having just that problem? In this case, then the comment above the code is incorrect in that it states that: // deal with a process that has exited, for whatever reason // (including preemption) Regardless, of the exit condition, the final CPU time should NOT be the last checkpoint CPU time. It should be the last checkpoint time plus the additional time since the last checkpoint. This item may also explain why the other day when I was watching the completion of a Einstein@Home Work Unit the time of completion jumped BACKwards by nearly 10 minutes. At any rate, I also don't claim that this is the problem, but, problem or not I don't think that the items I pointed to are right. My point was that this code, or code like it elsewhere, could be the issue in that we do think we know that there is a problem with LHC@Home and checkpointing and an issue with improper time keeping. This code is right within that intersection. [edit] I do see where I missed one point, the prior call to get_app_status_msg() should get the last checkpoint time from the science application. However, it still does do this in a "safe" manner in that if the Science Application does NOT have checkpoint time plus current run time, each call through this routine actually trims off some of the run time. Which explains why I have seen the total run time on two Einstein@Home WU "leap-back" by 6-8 minutes. I have not "trapped" this on other projects yet. Thierry, Question for you, are you running Einstein@Home and do you also see that "leaping" back behavior? ID: 6430 · Reply Quote

Jayargh Send message Joined: 24 Oct 04 Posts: 79 Credit: 257,762 RAC: 0	Message 6432 - Posted: 5 Mar 2005, 22:15:36 UTC - in response to Message 6430. > Which explains why I have seen the total run time on two Einstein@Home WU > "leap-back" by 6-8 minutes. I have not "trapped" this on other projects yet. For me Einstein goes to 99% uploads and no time differential. But have noticed this jumpback by a minute or two in Predictor,ver 4.22 used to go to 91% process at 97% for the 1-2 minutes upload and jump back to the 91% time value for final report. ver 4.24 now runs while at 100% jumps back to 97% to process,uploads and seems to report last time running at 100%. ID: 6432 · Reply Quote

Paul D. Buck Send message Joined: 2 Sep 04 Posts: 545 Credit: 148,912 RAC: 0	Message 6433 - Posted: 6 Mar 2005, 1:02:53 UTC - in response to Message 6432. > For me Einstein goes to 99% uploads and no time differential. But have > noticed this jumpback by a minute or two in Predictor,ver 4.22 used to go to > 91% process at 97% for the 1-2 minutes upload and jump back to the 91% time > value for final report. ver 4.24 now runs while at 100% jumps back to 97% to > process,uploads and seems to report last time running at 100%. Ah, good ... it is not only me! :) I had just been watching the ending of Work Units to see what the end stage looks like ... and I am getting about 3-4 minutes of processing at 100%, a jump back to 97%, and then a finish. The time elapsed work fine though ... On Einstein@Home I have been seeing it get to about 4 minutes from projected end to POW! Done! ... with the end time consistent with the time to end ... it looks like you "lost" a few minutes if you are looking, see that it will be done in about half an hour, and you come back and it is done, but missing about 4 minutes ... Which calls to mind Tom Lerher's statement about base 8 ... "Base 8 is like base ten really, ... if you are missing two fingers ..." :) ID: 6433 · Reply Quote

Alex Send message Joined: 2 Sep 04 Posts: 378 Credit: 10,765 RAC: 0	Message 6438 - Posted: 6 Mar 2005, 3:51:03 UTC - in response to Message 6430. > > Regardless, of the exit condition, the final CPU time should NOT be the last > checkpoint CPU time. It should be the last checkpoint time plus the > additional time since the last checkpoint. > > I think it should be. The checkpoint is the last known point where the program was running correctly. If the computing application doesn't run correctly, and crashes, or exits with an error code, then all of the data from the computing app after that last known good point is suspected bad. To debug this, you'd have to look at the result returned from the computing app, and see if it exited properly. ID: 6438 · Reply Quote

Alex Send message Joined: 2 Sep 04 Posts: 378 Credit: 10,765 RAC: 0	Message 6440 - Posted: 6 Mar 2005, 8:30:53 UTC .. and.. looking at some results... the the code review should look at why certain users have this error and some dont. ie.. http://lhcathome.cern.ch/workunit.php?wuid=6959 You may be noticing the fact that it's checkpointing, but that may just be the symptom of the problem, and not the problem itself. Perhaps boinc is having problems opening a temporary file (ie. a state file for example) and isn't trapping an error, so an error occurs in a lower level, doesn't get trapped as an error, and the boinc client is clueless because the subroutine that the error occured in returned (0) telling the rest of the program that things ran ok. Maybe some people are seeing this more often based on how fast their pc is switching between projects. My question are.. what version of boinc is running on the machines where people's clients actually report a computing error.(as opposed to the machines that don't report an error but 'get zero computation time') is there a difference in settings between people who have this error and those who do not. Ie.. are they 'switching between projects' every hour or every 3 minutes. I'm not the LHC Alex. Just a number cruncher like everyone else here. ID: 6440 · Reply Quote

Paul D. Buck Send message Joined: 2 Sep 04 Posts: 545 Credit: 148,912 RAC: 0	Message 6442 - Posted: 6 Mar 2005, 9:34:36 UTC - in response to Message 6438. > I think it should be. The checkpoint is the last known point where the > program was running correctly. > If the computing application doesn't run correctly, and crashes, or exits with > an error code, then all of the data from the computing app after that last > known good point is suspected bad. Ok, agreed. *IF* this is an abend then the time should be checkpoint. If normal, it should be current total, not last checkpoint. > To debug this, you'd have to look at the result returned from the computing > app, and see if it exited properly. Well, in many, if not most cases with the WU/Results I have looked at the exit code is (0) success. In the WU 6959 of Alex, all computers had exit state (0), all running the same version of sixtrack, though one was running v4.61 (?) of the BOINC Daemon. Anyway, to repeat, I am not saying that this is the problem. I just do not believe that this is coded correctly. Even if the intent is correct, there is no attempt to ensure that a bad checkpoint time does not overwrite a valid total time. In any loosely coupled, message passing system, you have to assume as a first principle that messages will be lost, and the time synchronization and message orders will also be potentially be unstable. The next new point that occurs to me is that I did not see a message queueing mechanism to keep the two sub-systems in full communication. If there is no message queue, but only a single message deep message "drop", a second message could overwrite the first message before the other sub-system "saw" the first message. So, that is I guess the next thing I can look at ... ID: 6442 · Reply Quote

STE\/E Send message Joined: 2 Sep 04 Posts: 352 Credit: 2,898,606 RAC: 4	Message 6444 - Posted: 6 Mar 2005, 9:37:52 UTC Last modified: 6 Mar 2005, 9:38:50 UTC My question are.. what version of boinc is running on the machines where people's clients actually report a computing error.(as opposed to the machines that don't report an error but 'get zero computation time') is there a difference in settings between people who have this error and those who do not. Ie.. are they 'switching between projects' every hour or every 3 minutes. ========== I get the 0 Time report Error & I'm running LHC exclusively on all my computers, so theres no switching between projects with me. I get the error with v4.19 - v4.24 - v4.25 ... The Error occurs mostly with the v64lhc type WU's, I don't see it that much with the v64boince type WU's although I will see it occasionally with them too ... ID: 6444 · Reply Quote

Thierry Van Driessche Send message Joined: 1 Sep 04 Posts: 157 Credit: 82,604 RAC: 0	Message 6446 - Posted: 6 Mar 2005, 11:30:45 UTC - in response to Message 6444. > My question are.. > what version of boinc is running on the machines where people's clients > actually report a computing error.(as opposed to the machines that don't > report an error but 'get zero computation time') I run first with CC v4.19, since yesterday with v4.25. I still see the problem of 00:00:00 CPU time. > is there a difference in settings between people who have this error and those > who do not. Ie.. are they 'switching between projects' every hour or every 3 > minutes. My switching is set to every hour. > I get the 0 Time report Error & I'm running LHC exclusively on all my > computers, so theres no switching between projects with me. I get the error > with v4.19 - v4.24 - v4.25 ... The Error occurs mostly with the v64lhc type > WU's, I don't see it that much with the v64boince type WU's although I will > see it occasionally with them too ... Same experience here: only a few problesm with the v64boince type of WU but a lot of with the v64lhc WU's . Best greetings from Belgium Thierry ID: 6446 · Reply Quote

STE\/E Send message Joined: 2 Sep 04 Posts: 352 Credit: 2,898,606 RAC: 4	Message 6448 - Posted: 6 Mar 2005, 15:46:47 UTC Case in Point: I just had 1 v64boince reset from over 9 hours to 3:26:00 when it finished and uploaded ... D'oh ID: 6448 · Reply Quote

Paul D. Buck Send message Joined: 2 Sep 04 Posts: 545 Credit: 148,912 RAC: 0	Message 6452 - Posted: 6 Mar 2005, 17:47:39 UTC Yes, well, my research shows that it is a problem that has been with us since at least September when I first started collecting data. I have not added the last couple hundred records (206 right now) so I don't know how many I have "in the wings" at the moment. But this is the first "sighting" of one that reset the time but to a number that is greater than zero but not even close to right. What is your checkpoint interval? Is it the default 60 seconds? ID: 6452 · Reply Quote

Alex Send message Joined: 2 Sep 04 Posts: 378 Credit: 10,765 RAC: 0	Message 6455 - Posted: 6 Mar 2005, 19:01:48 UTC - in response to Message 6452. > > What is your checkpoint interval? Is it the default 60 seconds? > Well, here's my settings: Switch between applications every (recommended: 60 minutes) 60 minutes Write to disk at most every 40 seconds I'm not the LHC Alex. Just a number cruncher like everyone else here. ID: 6455 · Reply Quote

STE\/E Send message Joined: 2 Sep 04 Posts: 352 Credit: 2,898,606 RAC: 4	Message 6457 - Posted: 7 Mar 2005, 0:03:05 UTC What is your checkpoint interval? Is it the default 60 seconds? ========== Yes, I just left it at the Default ... ID: 6457 · Reply Quote

STE\/E Send message Joined: 2 Sep 04 Posts: 352 Credit: 2,898,606 RAC: 4	Message 6458 - Posted: 7 Mar 2005, 1:23:51 UTC But this is the first "sighting" of one that reset the time but to a number that is greater than zero but not even close to right. ========== I've seen them do that before, I've been sitting there watching a 10+ hour WU finish and saw it reset to just 23 minutes when it finally did finish ... ID: 6458 · Reply Quote

Ben Segal Volunteer moderator Project administrator Send message Joined: 1 Sep 04 Posts: 143 Credit: 2,579 RAC: 0	Message 6477 - Posted: 7 Mar 2005, 12:47:50 UTC Many thanks guys for what seems to us to be a big step forward. We agree that with the present uncertainties of writing BOINC-Sixtrack checkpoints (including committing them properly to client disk using Fortran I/O once BOINC allows us to write them), plus the bug Paul reports in app_control.c, the potential ingredients are there to explain our current problems with both crediting and resuming results. We await with great interest the reaction of the BOINC team to the app_control.c bug report, and are currently working on improvements that may be needed to checkpointing on our side. Ben Segal / LHC@home ID: 6477 · Reply Quote

Paul D. Buck Send message Joined: 2 Sep 04 Posts: 545 Credit: 148,912 RAC: 0	Message 6488 - Posted: 7 Mar 2005, 15:42:58 UTC - in response to Message 6477. Last modified: 7 Mar 2005, 15:54:58 UTC > Many thanks guys for what seems to us to be a big step forward. > > We agree that with the present uncertainties of writing BOINC-Sixtrack > checkpoints (including committing them properly to client disk using Fortran > I/O once BOINC allows us to write them), plus the bug Paul reports in > app_control.c, the potential ingredients are there to explain our current > problems with both crediting and resuming results. > > We await with great interest the reaction of the BOINC team to the > app_control.c bug report, and are currently working on improvements that may > be needed to checkpointing on our side. Ben, I never said it was a bug, I said it was poor programing to assume that messages from loosly coupled systems would always be what we expected and correct. I THINK that it may be the source of the problem because of the assumption it makes and the unprotected way the system is updated from message communitcation. Even more troubling to me is that the checkpoint transaction seems to be a complex one with at least 4 different messages of calls. As in: 1) BOINC Daemon sets checkpoint permitted 2) Science Application checks to see if checkpoint is permitted {#1} 3) If checkpoint is permitted SA performs checkpoint and send CP end to the Daemon {#2} 4) Daemon calls SA to get checkpoint data {#3} 5) Daemon performs call to checkpoint done {#4} 6) SA polling CK Done gets signal {#5} and resumes processing This presupposes that I did not mess it up, or misunderstand the code, both highly likely. Since I could not find documentation on what UCB calls the FSM for checkpointing, well, there is no way for me to know what the intent of the designer was. Now, that IS a flaw, though not a true "bug"; and one I have been complaining of since early beta. What most people don't seem to get about documentatiion is that it conveys intent in a form that ALL can read. Then the intent can be compared to actuals and corrections made. The current "mania" for the so-called "light-weight" processes mostly assume that documentation is of no use as a first principle. However, the concept of documentation being a disconnected entity from the system is also flawed. If you will, the contents of my site is, in fact, a partial design document. My most serious lack is the fact that I have to assume, infer, or guess at the business rules embodied in the system ... Did I mention that I used to be a systems engineer and this is one of my "hot" buttons? ... :) ID: 6488 · Reply Quote

Trane Francks Send message Joined: 18 Sep 04 Posts: 71 Credit: 28,399 RAC: 0	Message 6521 - Posted: 9 Mar 2005, 12:49:24 UTC - in response to Message 6488. > I never said it was a bug, I said it was poor programing to assume that > messages from loosly coupled systems would always be what we expected and > correct. That would be a bug, Paul, in my opinion. As part of my living, I write software. If software isn't prepared to appropriately handle the messages it receives, that's as much a bug as it would be to receive one datatype while expecting another (crash). ID: 6521 · Reply Quote

LHC@home