Bloom filters for have/want negotiation

Michael Haggerty <mhagger@xxxxxxxxxxxx> · Fri, 11 Sep 2015 23:13:25 +0200

I have been thinking about Wilhelm Bierbaum's talk at the last GitMerge
conference [1] in which he describes a scheme for using Bloom filters to
make the initial reference advertisement less expensive.

In his scheme (if I understand correctly) the client starts off the
conversation by passing the server a Bloom filter that indicates what
(refname, SHA-1) pairs the client already has. This makes it unnecessary
for the server to advertise those references, thereby reducing the cost
of incremental fetches dramatically if the server has very many references.

Because Bloom filters have false positives, this scheme is not 100%
reliable. Therefore I don't think we would want Git to depend on it.

But it got me thinking about how the client could use a Bloom filter in
a later stage of the negotiation, when telling the server what objects
it already has, while preserving 100% reliability.

The idea is to use connectivity information to correct mistakes caused
by reliance on the Bloom-filter:

1. The server advertises the references that it has in the way that it
is currently done.
2. The client advertises the objects that it has (or some subset of
them; see below) via a Bloom filter.
3. The server sends the client the packfile that results from assuming
that the Bloom filter is giving correct answers. (This might mean that
too few objects are sent to the client.)
4. The client does a connectivity check on the packfile. If any objects
are missing, it asks the server for them via a reliable
(non-Bloom-filter-based) request.

How would one construct the Bloom filter for step 2? (Remember that a
properly-configured Bloom filter requires about 5 bits of space per
value stored for each factor of 0.1 in the false-positive rate. So, for
example, to store 5000 values with a 1% false-positive rate, the Bloom
filter would need to be approximately 5000 * 10 bits = 6.2 kB in size.)

Here are some possible schemes:

* Record *all* objects in the Bloom filter. The Git repo has
approximately 200k objects, so, supposing that we could live with a 10%
false-positive rate (see below), the Bloom filter would need to be about
125 kB.

* Record all commit objects in the Bloom filter. For the Git repo that
is about 40k commits, so for a 10% error rate the Bloom filter would
have to be about 25 kB.

* Record some subset of commits; for example, all unique branch and tag
tips, the peeled tags, plus some sparse subsets of commits deeper in the
history. The ls-remote for the Git repo lists 1730 unique SHA-1s, so,
supposing we choose 10x that number with a 1% error rate, the Bloom
filter would be about 20 kB.

* Record only the branch and tag tips and peeled tags. Please note that
for situations where the client has fetched from the server before and
still has the remote-tracking references from that fetch, this scheme
might work surprisingly well. For the Git repository, with a 1% error
rate, this would be about 2 kB.

For the first two schemes, we could tolerate a pretty high error rate
because the server could perform additional consistency checks to reduce
the error rate. For example, if the Bloom filter reports that the client
has commit X, but that the client does *not* have a parent of X, then
the server can assume that the check of X was a false positive and
discard it. Such consistency checks would not be possible with the third
or fourth schemes, so I have chosen lower false-positive rates for those
schemes.

Additional points:

* The client can decide what to include in the Bloom filter. For
example, if it has done a recent fetch from the server, it might want to
send only the remote-tracking branch tips. But if it has never fetched
from this server before, it might want to send all commits.

* A Bloom filter could be computed at repack time rather than at each
fetch. On fetch, the precomputed Bloom filters could be loaded, any
loose objects added to it, and the result sent to the server.

I don't have a gut feeling about the cost of this phase of the
negotiation, so I don't know whether this would be a net savings, let
alone one that is worth the added complexity. But I wanted to document
the idea in case somebody thinks it has promise. (I have no plans to
pursue it.)

Michael

[1] http://git-merge.com/videos/scaling-git-at-twitter-wilhelm-bierbaum.html

-- 
Michael Haggerty
mhagger@xxxxxxxxxxxx

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html