Ghosts and Pending Credit

Author	Message
River~~ Send message Joined: 13 Jul 05 Posts: 456 Credit: 75,142 RAC: 0	Message 15263 - Posted: 31 Oct 2006, 8:47:24 UTC Last modified: 31 Oct 2006, 9:20:51 UTC I was shocked to find I have credit pending from the 24th Oct work release, some of which relates to WU where nobody else has returned work. At first I was cross with the other participants, but then I looked a bit more closely and it is not their fault. Six computers (none of them mine): A + B + C + D + E + F; and my box and their results: A + B + C + D + E + F; and my results Computer A shared more than one of my pending WU, thoe others one each. Box E's owner would have been particularly delighted to crunch and return the task, as it would have been the first on that box. These six computers all were issued work at 15:08 UTC on 24th Oct, and none of them returned it. Five of them have had other work before or since and returned it promptly, the other, E, it was a single task and the only work it has received in recent times. Therefore it is clear that these are ghost wu, that is tasks the scheduler recorded as being issued but which were never received by the respective clients. All my pending credit from that work release relates to tasks issued to others during 15:08. EDIT (correction) I was issued this work at 15:05, so at first I thought it was simple, 15:05 ok, 15:08 problem. However my box also got work at 15:08 that day, which were also received by other hosts at that time and crunched ok. It is also clear that something happened during this minute, 15:08, that caused the error on some boxes and not others. This was probably something on the server (which may still be in a server log?) or if not it must have been something on the CERN LAN as there is unlikely to be any other common factor. It is clear why this happened. Tasks from the same WU were being issued at the same time and were all lost together when whatever happened did happen. I am reporting this in case it is a useful observation in curing the ghost wu issue. Anyone on the BOINC dev mailing lists please feel free to report it on if you think it would be useful. And of course to apologise to my six fellow participants in case my brief annoyance leaked via some extra-sensory channel - if your ears were red, blame me ;-) River~~ ID: 15263 · Reply Quote

Ingleside Send message Joined: 1 Sep 04 Posts: 36 Credit: 78,199 RAC: 0	Message 15266 - Posted: 31 Oct 2006, 17:21:29 UTC Well, it won't stop the ocassional instances there scheduler-reply never makes it back to client, but a project can choose to re-issue any "lost" work, by changing their config-file, specifically by adding <resend_lost_results/> For this to work, users must also run BOINC-client v4.45 or later. "I make so many mistakes. But then just think of all the mistakes I don't make, although I might." ID: 15266 · Reply Quote

River~~ Send message Joined: 13 Jul 05 Posts: 456 Credit: 75,142 RAC: 0	Message 15272 - Posted: 31 Oct 2006, 21:35:49 UTC - in response to Message 15266. ... a project can choose to re-issue any "lost" work, by changing their config-file, specifically by adding <resend_lost_results/> ... Thanks Ingleside If this setting is adopted, for how long does this work after the tasks were originally issued, and is the deadline the same as the original, or recalculated from the time of re-issue? I can see problems either way. If the old deadline is kept, the client may have filled up with work from another project in the meantime. If the deadline is recalculated, some XXXX is going to figure out how to use the feature to extend the deadlines of their work. Having said which, I'd still welcome this setting on this project. R~~ ID: 15272 · Reply Quote

Ingleside Send message Joined: 1 Sep 04 Posts: 36 Credit: 78,199 RAC: 0	Message 15278 - Posted: 1 Nov 2006, 2:57:50 UTC - in response to Message 15272. If this setting is adopted, for how long does this work after the tasks were originally issued, and is the deadline the same as the original, or recalculated from the time of re-issue? I can see problems either way. If the old deadline is kept, the client may have filled up with work from another project in the meantime. If the deadline is recalculated, some XXXX is going to figure out how to use the feature to extend the deadlines of their work. Having said which, I'd still welcome this setting on this project. R~~ Not quite sure on the deadline, but if not mistaken it's sometimes the old deadline, and sometimes a new deadline, but not neccessarily as long as normal deadline... Re-issuing will only happen if you're asking for work, and if it looks like project has been reset or detached/re-attached will not re-issue any work. Lastly, results that isn't needed any longer, due to wu already validated or errored-out, will not be re-issued. "I make so many mistakes. But then just think of all the mistakes I don't make, although I might." ID: 15278 · Reply Quote

Henry Nebrensky Send message Joined: 13 Jul 05 Posts: 170 Credit: 15,020,549 RAC: 48	Message 15292 - Posted: 1 Nov 2006, 22:56:06 UTC - in response to Message 15263. Last modified: 1 Nov 2006, 23:02:14 UTC I've got a couple of ghost WUs assigned to 43358 - those at 22:53:34 29/10 and 3:30:04 30/10 never arrived at the machine. In case an admin ever tries to hunt such things down, the logs at my end look like 2006-10-30 03:30:03 [---] Insufficient work; requesting more 2006-10-30 03:30:03 [lhcathome] Requesting 21600.00 seconds of work 2006-10-30 03:30:03 [lhcathome] Sending request to scheduler: http://lhcathome.cern.ch/lhcathome_cgi/cgi 2006-10-30 03:30:04 [lhcathome] Scheduler RPC to http://lhcathome.cern.ch/lhcathome_cgi/cgi failed 2006-10-30 03:30:04 [lhcathome] Scheduler RPC to http://lhcathome.cern.ch/lhcathome_cgi/cgi failed 2006-10-30 03:30:04 [lhcathome] No schedulers responded 2006-10-30 03:30:04 [lhcathome] No schedulers responded 2006-10-30 03:30:04 [lhcathome] Deferring communication with project for 1 hours, 14 minutes, and 42 seconds 2006-10-30 03:30:04 [lhcathome] Deferring communication with project for 1 hours, 14 minutes, and 42 seconds whereas a successfully unsuccessful attempt to get work looks like 2006-10-30 03:07:36 [---] Insufficient work; requesting more 2006-10-30 03:07:36 [lhcathome] Requesting 21600.00 seconds of work 2006-10-30 03:07:36 [lhcathome] Sending request to scheduler: http://lhcathome.cern.ch/lhcathome_cgi/cgi 2006-10-30 03:07:37 [lhcathome] Scheduler RPC to http://lhcathome.cern.ch/lhcathome_cgi/cgi succeeded 2006-10-30 03:07:37 [lhcathome] Message from server: Server can't open database 2006-10-30 03:07:37 [lhcathome] Project is down 2006-10-30 03:07:37 [lhcathome] Project is down 2006-10-30 03:07:37 [lhcathome] Deferring communication with project for 22 minutes and 26 seconds The last WU's real and the machine is still churning away - and will be for a couple more weeks judging by how long the other boxes took! :( Thanks Henry ID: 15292 · Reply Quote

River~~ Send message Joined: 13 Jul 05 Posts: 456 Credit: 75,142 RAC: 0	Message 15293 - Posted: 1 Nov 2006, 23:28:12 UTC - in response to Message 15292. Last modified: 1 Nov 2006, 23:30:18 UTC I've got a couple of ghost WUs ... Thanks Henry, it is very useful that somebody has thought to look out the logs at the client end and see waht they look like. Both your ghost WU were completed by other crunchers by the original deadline - on this project this is possible because 5 copies are sent out but only three are needed (assuming the first three back all pass vailidation). In fact all 4 tasks from each of your WU got returned, so your two lost WU could have gone astray in any router anywhere from CERN to you. In contrast, the ones I reported got lost on their way to several other boxes, all at the same time. That does suggest that the location is at the CERN end, before the packets get routed out in different direction. The location is more specific because of the multiple destinations involved in the same loss. By the way, I notice you support team GridPP - do you have anything to do with the migration to QMC ? or are you in a different arm of GridPP ? ...and looking at your boxes' operating systems I am going to have to be careful as you will have spotted where I lifted my avatar from ;-) River~~ ID: 15293 · Reply Quote

Henry Nebrensky Send message Joined: 13 Jul 05 Posts: 170 Credit: 15,020,549 RAC: 48	Message 15298 - Posted: 2 Nov 2006, 1:44:33 UTC - in response to Message 15293. ... In fact all 4 tasks from each of your WU got returned, so your two lost WU could have gone astray in any router anywhere from CERN to you. In contrast, the ones I reported got lost on their way to several other boxes, all at the same time. That does suggest that the location is at the CERN end... I meant to include that on a couple of occasions I've found clients apparently hung with a connection open to the scheduler, but killed them off without being assigned a ghost. My assumption (and I'm not a DB admin) would be that WU allocations within the DB are updated before the scheduler sends details back to the client, to save having the scheduler holding idle connections to the DB server open. Hence may there be a failure mode where WUs get allocated in the database, but an overloaded server means they don't get from the scheduler to the client? In that case, how many results within a WU are affected may depend on just how swamped the box was at the time, and any timeout settings in the client. ...your two lost WU could have gone astray in any router anywhere from CERN to you. before the packets get routed out in different direction. The location is more specific because of the multiple destinations involved in the same loss. http sits on top of TCP, so if any packets get lost this will be noticed and the data resent. Eventually. Web filters and proxies can do strange things, but they'd be at the client end. By the way, I notice you support team GridPP - do you have anything to do with the migration to QMC ? Actually I left GridPP in the summer - that's why I'm stuck here instead of living it up at the Edinburgh meeting, getting the bee^H^H details at first hand. :( or are you in a different arm of GridPP ? Well, there wasn't an obvious BOINC arm when I left... so no, I don't know any more than the rest of the public about the actual timescales involved, whether QMUL have people in place for this or need to hire, if QMUL already have BOINC experience or if the new guys will start from scratch, etc. I'd like to re-iterate the point you've made elsewhere (in /stats empty) though: if the boards show only repeated whining and forceful demands, this project is not going to attract the high-calibre staff it obviously needs. Henry ID: 15298 · Reply Quote

River~~ Send message Joined: 13 Jul 05 Posts: 456 Credit: 75,142 RAC: 0	Message 15303 - Posted: 2 Nov 2006, 7:25:37 UTC - in response to Message 15298. Last modified: 2 Nov 2006, 7:34:44 UTC Hi Henry, thanks for your thoughtful response. Hopefully this dialogue will get read at some stage by incoming admins, and I hope your thinking and mine will be of some assistance to them. My assumption (and I'm not a DB admin) would be that WU allocations within the DB are updated before the scheduler sends details back to the client, to save having the scheduler holding idle connections to the DB server open. Not only that, but there is theoretical issue too - without a distributed transaction protocol between the client and the db, one end always has to commit first, leaving a risk that the other end will be left adrift if the final confirmation gets lost. It is clear that the better solution in terms of not wasting work is for the db to go first, leading to ghost WU, rather than the client to go first, leading to overly redundant WU. Hence may there be a failure mode where WUs get allocated in the database, but an overloaded server means they don't get from the scheduler to the client? ... At least one - more likely imo many failure modes, each of them occurring rarely. Just off the top of my head, the server script can fail after db commit, a packet can get lost en route, the client machine may reboot at exactly the wrong time, etc etc. In short anything that can go wrong after db commit will produce this symptom, and this includes the machines at both end and en route. Hey! thinking about this list it is lucky we see so few ghosts ;-) ...your two lost WU could have gone astray in any router anywhere from CERN to you. before the packets get routed out in different direction. The location is more specific because of the multiple destinations involved in the same loss. http sits on top of TCP, so if any packets get lost this will be noticed and the data resent. Eventually. Eventually is too late if by then the client end has timed out, unilaterally closed the TCP connection, and put up the "no schedulers responded" message. Eventually is impossible if a router is overloaded and needs rebooting (a common problem with user level ADSL routers when they run out of buffer space. typically provoked by excessive peer to peer working). Eventually is irrelevant if a net cable has been pulled out! Apache will retry only so many times and then give up. If my "many rare faults" hypothesis is correct, extending the TCP timeout in the client would redeem a few potential ghosts, but at the cost of the client waiting longer on a genuinely dead connection - which means the client would wait longer to go on to try the next project. I'd personally be reluctant to advocate an increased timeout, and if someone else wants to then the trade off has to be addressed. R~~ ID: 15303 · Reply Quote

Henry Nebrensky Send message Joined: 13 Jul 05 Posts: 170 Credit: 15,020,549 RAC: 48	Message 15376 - Posted: 7 Nov 2006, 1:41:12 UTC - in response to Message 15303. Not only that, but there is theoretical issue too - without a distributed transaction protocol between the client and the db, one end always has to commit first, leaving a risk that the other end will be left adrift if the final confirmation gets lost. It is clear that the better solution in terms of not wasting work is for the db to go first, leading to ghost WU, rather than the client to go first, leading to overly redundant WU. I'd disagree - WUs are already redundant, and I think it makes better theoretical sense to risk excess redundancy than excess ghosts, which might mean that a WU never gets crunched (on the first pass, anyway). But I don't see a sensible way to implement that with http - I don't believe there's a way for a script to know that the client has received its output, so you'd end up having to use multiple transactions and so more work for the DB. Hey! thinking about this list it is lucky we see so few ghosts ;-) We see so few real ones ... maybe we're just hallucinating WUs now :) Henry ID: 15376 · Reply Quote

River~~ Send message Joined: 13 Jul 05 Posts: 456 Credit: 75,142 RAC: 0	Message 15379 - Posted: 7 Nov 2006, 8:25:45 UTC - in response to Message 15376. ... I'd disagree - WUs are already redundant, and I think it makes better theoretical sense to risk excess redundancy than excess ghosts, which might mean that a WU never gets crunched (on the first pass, anyway). We have an honest difference of opinion on that, Henry. I'd say that the WU's are already excessively redundant already (with 2/5 of the work sent not being needed for quorum) and we don't need any more. But I don't see a sensible way to implement that with http - I don't believe there's a way for a script to know that the client has received its output I think there is more than one way to do it, but none of them all that good. Using a double connection you'd maybe implement a status of "Issued" (to prevent multiple issue during the handshake) updated to "Accepted" when the client confirmed with a second http connection. This would double the connection count which is already the bottleneck on a number of projects (has been on Einstein in the past, I know). In addition, you'd need a mechanism for "Issued" work to count as never sent when the date sent was more than X in the past, and this would add to the processing cost of choosing work to send. It can be done without an extra connection by waiting for the next natural connection to complete the handshake. On issue the task is marked as "Issued". On each connection the client send a list of queued work. On the next connection from that client, if the result is returned then the code is as now. If that task is not ready to be returned then it is marked as either "InQueue" or marked as "Failed download" or some new status ("Ghost") that would trigger creation of a new task, just like an errored task does. For most server purposes there is no difference between "InQueue" or "Issued", but it enables participants and admins to know if the work was still on the client's HD last time it connected. Advantages over resending would be that a dodgy client would not indefinitley delay the same wu, but would slightly delay lots of them. Where a project is waitng for the last WU to complete this is probably a better outcome. The "time out" on the ghost wu reduces from the project deadline to the host connect interval. And if the next connect never happens, or is after the deadline, then everything acts as it does now, with the work reissued due to timeout. An advantage over the double-connect is that this scheme spots work that is deleted at the client end after being successfully received, so covers more error conditions than the double-connect. The expensive part of the db work - the need to search the db for all work issued to that host - is the exactly same as for resend. This is my preferred solution. There is no increase in redundancy but it reduces the bad effects of the ghost wu. I'd suggest that the option to resend should still exist, but that there should always be a check that the result is held on the client. The resend option would thus control whether the work was resent to the same host, or marked as bad so that a regenerated task can be sent elsewhere. R~~ ID: 15379 · Reply Quote

KWSN - A Shrubbery Send message Joined: 3 Jan 06 Posts: 14 Credit: 32,201 RAC: 0	Message 15414 - Posted: 11 Nov 2006, 0:39:04 UTC Last modified: 11 Nov 2006, 0:56:05 UTC I can't tell you how it's being implemented but I know it is possible. With Rectalinear Crossing Numbers I often see the message that xxxxx.xx result is being resent due to download problems. Every time one of my hosts connects it manages to compare my queue with what the server thinks I should have and resend the missing ones. ID: 15414 · Reply Quote

LHC@home