Re: remove_duplicates() in builtin/fetch-pack.c is O(N^2)

Martin Fick <mfick@xxxxxxxxxxxxxx> · Tue, 22 May 2012 18:46:37 -0600

On Tuesday, May 22, 2012 05:23:17 pm Junio C Hamano wrote:
> Martin Fick <mfick@xxxxxxxxxxxxxx> writes:
>
> I think we can ignore that 400 part from the above for
> now.  As they are independent "projects", it is
> unreasonable to expect that any solution would add costs
> less than linearly when you add more projects.  

As far as git is concerned, sure they are independent, but 
the projects aren't really independent.  Separate repos were 
likely selected for scalibility reasons in the first place, 
but this ends up being a "scale fail" pattern.  I don't 
expect git to fix this * 400 issue, but I want to highlight, 
how difficult it is to make things scale in these cases and 
that this problems is a natural multiplier when you split 
something into separate git repos because your tags tend to 
now be multiplied.  So while 10K tags sound like a lot but 
manageable for 1 repo, it ends up actually being 4M tags if 
you split your repos.

...
> I think the assumption in your build-bot scenario needs
> to be a bit more clarified to give context.  I am
> assuming, when you say "see if projects are up to date",
> you mean your build-bot is interested in a single branch
> in each repository, possibly together with new tags that
> have become reachable from the updated tip of the branch
> (if the branch tip moved). 

Just the branch generally.

> I also assume that the
> build-bot polls very often and that is why 200MB (or
> divided by 400 it would be 0.5MB, which still is a lot
> when you talk about a single repository) hurts a lot.

To clarify for others (since you likely know), we serialize 
our submits, so the "poll" is before every build to ensure 
that we are up to date with what just got merged.  In our 
case we use Gerrit with slaves (read only Gerrit copies), so 
we offload the "update" to the slave, which incurs the same 
hit, all though it is not quite a No Op but close (there 
will have been a batch of commits added to a few projects).  
But then since the slave may be slightly behind the master 
(replication from the master to the slave takes time, 
partially because of the same ref advertisement problem), so 
we want to verify the slave update by polling the master, 
which if replication was fast enough would be a No Op.  So, 
our attempts to scale by using slaves does not help very 
much since the real update and the No Op are almost 
identical in their impacts... There are workarounds, but 
mostly hacks.

...
> Unlike the capability exchange that happens after the
> initial connection, there is no gap in the protocol for
> the side that initiates the connection to tell the
> serving side to skip/delay the initial ref
> advertisement.

Crazy stupid suggestion: could the URI somehow be exploited 
to signal the option to disable advertisements?  What if the 
path part of the URI started with something like several 
slashes?  Older servers just strip the extra slashes right?  
Newer servers could disable advertisements if they see the 
slashes.  Of course, this gets ugly if scripts add slashes 
and expect them to be stripped (old clients talking to new 
servers would probably then fail),

-Martin

-- 
Employee of Qualcomm Innovation Center, Inc. which is a 
member of Code Aurora Forum
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html