Re: remove_duplicates() in builtin/fetch-pack.c is O(N^2)

Junio C Hamano <gitster@xxxxxxxxx> · Tue, 22 May 2012 16:23:17 -0700

Martin Fick <mfick@xxxxxxxxxxxxxx> writes:

> Android consists of approximately 400 projects, and if 
> anyone tags their builds regularly, that means that they 
> will be tagging 400 projects per build.  We have something 
> like 10K tags on average across 400 projects, so when we do 
> a simple No Op sync just to see if all 400 projects are up 
> to date, this yields about 200MB of data over the wire (4M 
> refs)!!!
> ...
> As you can imagine, we really would like to see this 
> eliminated from our workflows.  If we want to check 400 
> projects to see if they are up to date, it should be 400 
> refs, not 400 * 10K.

I think we can ignore that 400 part from the above for now.  As they are
independent "projects", it is unreasonable to expect that any solution
would add costs less than linearly when you add more projects.  But it
should be possible to make the cost of update discovery comparable between
the cases where you have 10K existing refs that both ends have seen and
one ref got updated vs you started from 20K common refs.  The latter does
not have to cost twice (or 4 times if an inefficient way is quadratic).

I think the assumption in your build-bot scenario needs to be a bit more
clarified to give context.  I am assuming, when you say "see if projects
are up to date", you mean your build-bot is interested in a single branch
in each repository, possibly together with new tags that have become
reachable from the updated tip of the branch (if the branch tip moved).  I
also assume that the build-bot polls very often and that is why 200MB (or
divided by 400 it would be 0.5MB, which still is a lot when you talk about
a single repository) hurts a lot.

Everything below is what I think you already know about; I am giving an
overview of _my_ understanding of the issue for the benefit of others (and
have you sanity check if I misunderstood your pain points).

Even an attempt to optimize "can we skip updating" by asking "ls-remote"
about a single branch is futile with the current protocol, as we expect
the server to respond with the full ref advertisement and filter out the
result on our end.  In the upload-pack protocol, the serving side talks
first and that is "here are all the refs I have and their values".  The
other side that asks for objects does not have a way to tell it that it is
only interested in a subset, even if it wanted to.  Unlike the capability
exchange that happens after the initial connection, there is no gap in
the protocol for the side that initiates the connection to tell the
serving side to skip/delay the initial ref advertisement.

What the above means is that the transition has to happen by keeping the
current upload-pack and a new protocol has to be invoked by a different
program ("upload-pack-2" or something needs to be taught to the git-daemon
and also invoked by ssh) even if an updated protocol is designed. The
updated connection initiator (i.e. ls-remote and fetch) may be able to try
upload-pack-2 first and fall back to upload-pack after getting a failure,
but afaik nobody figured out if this is doable by checking if a failure
signal that comes from currently deployed software is detailed enough for
our side to tell if the failure is because the other end does not support
upload-pack-2 or because of some other failure (e.g. auth failed, no such
repo, etc.).  Regardless of auto-fallback, an updated connection initiator
needs a configuration switch to be explicitly told which protocol to talk
to with which remote.

What exact protocol "upload-pack-2" talks may be an interesting subject on
its own.  It may still have the serving side talk first (probably show its
capabilities and tips of branches if there are not too many of them), or
it may instead let the connection initiator talk first and ask for
specific set of refs.  But that is of lessor importance from the
high-level point of view and I won't go into the details in this thread.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html