Re: Bloom filters for have/want negotiation

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Shawn Pearce <spearce@xxxxxxxxxxx> writes:

> The worst case is due to a bug in the negotiation. With nothing
> common, the client just goes on forever until it reaches roots
> (something is wrong with MAX_IN_VAIN). We saw 56,318 have lines ... a
> 2.6 MiB section. But smart HTTP gzips, so this may be only 1.3 MiB on
> the wire.

This one is interesting.

> But average/worst doesn't tell the entire picture. Bucketing requests
> by have lines gives us more:
>
>   %req | have_lines
>   -----|-----------
>   100%   56,318
>    99%       88
>    98%       52
>    97%       35
>    96%       31
>    95%       26

So is this.  From this observation, at least for your audience, it
is expected that we would usually not need a very long back and
forth session to discover where the history diverges.

But that is kind of expected.  Because the current protocol, in
which the upload-pack speaks first and advertises all tips it has,
allows the fetcher to omit what is known to be common very early and
to concentrate on sending the "have"s from the branches the fetcher
has worked on since the last time it fetched.  The amount of "have"s
needed is expected to be small in an "everybody meets at the same
central place and pushes into and fetches out of there" workflow,
because the amount of work done by a single fetcher since the last
fetch will by definition be a lot smaller than what happened in the
central meeting place.

I wonder how flipping the "who speaks first" would affect that
equation, though.

> Ergo, if this is all working correctly on smart HTTP, clients can
> fetch from a server they already "know" with decent efficiency, and
> smaller than your 2 KiB Bloom filter estimate for git.git at 1% error
> rate.

Isn't this part a bit of oranges and apples comparison?  If I
understand the motivation of Michael's looking into Bloom filter or
some other techniques correctly, it is to find a way to address the
initial advertisement from the sender.  Your analysis is about the
amount of "have", which is an orthogonal issue.

> I do wonder if we are stuffing the pipe deep enough with multi_ack on
> Internet links. Git doesn't need very long to walk 16 commits,
> certainly less than the 200 ms RTT a user might have talking to a
> server on the other side of the US. It is possible both sides are
> spending most of their time waiting for data transfer of the batches
> in flight.

Yes, this is a very useful insight.  An experiment or two with
larger amount of data in flight may be an interesting thing to try.
Do we still have the deadlock possibility caused by our implicit
reliance on the fact that a single batch was expected to fit in a
pipe buffer, by the way, or have we addressed that issue already?
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]