I have been thinking about Wilhelm Bierbaum's talk at the last GitMerge conference [1] in which he describes a scheme for using Bloom filters to make the initial reference advertisement less expensive. In his scheme (if I understand correctly) the client starts off the conversation by passing the server a Bloom filter that indicates what (refname, SHA-1) pairs the client already has. This makes it unnecessary for the server to advertise those references, thereby reducing the cost of incremental fetches dramatically if the server has very many references. Because Bloom filters have false positives, this scheme is not 100% reliable. Therefore I don't think we would want Git to depend on it. But it got me thinking about how the client could use a Bloom filter in a later stage of the negotiation, when telling the server what objects it already has, while preserving 100% reliability. The idea is to use connectivity information to correct mistakes caused by reliance on the Bloom-filter: 1. The server advertises the references that it has in the way that it is currently done. 2. The client advertises the objects that it has (or some subset of them; see below) via a Bloom filter. 3. The server sends the client the packfile that results from assuming that the Bloom filter is giving correct answers. (This might mean that too few objects are sent to the client.) 4. The client does a connectivity check on the packfile. If any objects are missing, it asks the server for them via a reliable (non-Bloom-filter-based) request. How would one construct the Bloom filter for step 2? (Remember that a properly-configured Bloom filter requires about 5 bits of space per value stored for each factor of 0.1 in the false-positive rate. So, for example, to store 5000 values with a 1% false-positive rate, the Bloom filter would need to be approximately 5000 * 10 bits = 6.2 kB in size.) Here are some possible schemes: * Record *all* objects in the Bloom filter. The Git repo has approximately 200k objects, so, supposing that we could live with a 10% false-positive rate (see below), the Bloom filter would need to be about 125 kB. * Record all commit objects in the Bloom filter. For the Git repo that is about 40k commits, so for a 10% error rate the Bloom filter would have to be about 25 kB. * Record some subset of commits; for example, all unique branch and tag tips, the peeled tags, plus some sparse subsets of commits deeper in the history. The ls-remote for the Git repo lists 1730 unique SHA-1s, so, supposing we choose 10x that number with a 1% error rate, the Bloom filter would be about 20 kB. * Record only the branch and tag tips and peeled tags. Please note that for situations where the client has fetched from the server before and still has the remote-tracking references from that fetch, this scheme might work surprisingly well. For the Git repository, with a 1% error rate, this would be about 2 kB. For the first two schemes, we could tolerate a pretty high error rate because the server could perform additional consistency checks to reduce the error rate. For example, if the Bloom filter reports that the client has commit X, but that the client does *not* have a parent of X, then the server can assume that the check of X was a false positive and discard it. Such consistency checks would not be possible with the third or fourth schemes, so I have chosen lower false-positive rates for those schemes. Additional points: * The client can decide what to include in the Bloom filter. For example, if it has done a recent fetch from the server, it might want to send only the remote-tracking branch tips. But if it has never fetched from this server before, it might want to send all commits. * A Bloom filter could be computed at repack time rather than at each fetch. On fetch, the precomputed Bloom filters could be loaded, any loose objects added to it, and the result sent to the server. I don't have a gut feeling about the cost of this phase of the negotiation, so I don't know whether this would be a net savings, let alone one that is worth the added complexity. But I wanted to document the idea in case somebody thinks it has promise. (I have no plans to pursue it.) Michael [1] http://git-merge.com/videos/scaling-git-at-twitter-wilhelm-bierbaum.html -- Michael Haggerty mhagger@xxxxxxxxxxxx -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html