Re: Git Scaling: What factors most affect Git performance for a large repo?

David Turner <dturner@xxxxxxxxxxxxxxxx> · Fri, 20 Feb 2015 13:29:12 -0500

On Thu, 2015-02-19 at 23:57 -0700, Martin Fick wrote:
> On Feb 19, 2015 5:42 PM, David Turner <dturner@xxxxxxxxxxxxxxxx> wrote:
> >
> > On Fri, 2015-02-20 at 06:38 +0700, Duy Nguyen wrote: 
> > > >    * 'git push'? 
> > > 
> > > This one is not affected by how deep your repo's history is, or how 
> > > wide your tree is, so should be quick.. 
> > > 
> > > Ah the number of refs may affect both git-push and git-pull. I think 
> > > Stefan knows better than I in this area. 
> >
> > I can tell you that this is a bit of a problem for us at Twitter.  We 
> > have over 100k refs, which adds ~20MiB of downstream traffic to every 
> > push. 
> >
> > I added a hack to improve this locally inside Twitter: The client sends 
> > a bloom filter of shas that it believes that the server knows about; the 
> > server sends only the sha of master and any refs that are not in the 
> > bloom filter.  The client  uses its local version of the servers' refs 
> > as if they had just been sent.  This means that some packs will be 
> > suboptimal, due to false positives in the bloom filter leading some new 
> > refs to not be sent.  Also, if there were a repack between the pull and 
> > the push, some refs might have been deleted on the server; we repack 
> > rarely enough and pull frequently enough that this is hopefully not an 
> > issue. 
> >
> > We're still testing to see if this works.  But due to the number of 
> > assumptions it makes, it's probably not that great an idea for general 
> > use. 
> 
> Good to hear that others are starting to experiment with solutions to this problem!  I hope to hear more updates on this.
> 
> I have a prototype of a simpler, and
> I believe more robust solution, but aimed at a smaller use case I think.  On connecting, the client sends a sha of all its refs/shas as defined by a refspec, which it also sends to the server, which it believes the server might have the same refs/shas values for.  The server can then calculate the value of its refs/shas which meet the same refspec, and then omit sending those refs if the "verification" sha matches, and instead send only a confirmation that they matched (along with any refs outside of the refspec).  On a match, the client can inject the local values of the refs which met the refspec and be guaranteed that they match the server's values.
> 
> This optimization is aimed at the worst case scenario (and is thus the potentially best case "compression"), when the client and server match for all refs (a refs/* refspec)  This is something that happens often on Gerrit server startup, when it verifies that its mirrors are up-to-date.  One reason I chose this as a starting optimization, is because I think it is one use case which will actually not benefit from "fixing" the git protocol to only send relevant refs since all the refs are in fact relevant here! So something like this will likely be needed in any future git protocol in order for it to be efficient for this use case.  And I believe this use case is likely to stick around.
> 
> With a minor tweak, this optimization should work when replicating actual expected updates also by excluding the expected updating refs from the verification so that the server always sends their values since they will likely not match and would wreck the optimization.  However, for this use case it is not clear whether it is actually even worth caring about the non updating refs?  In theory the knowledge of the non updating refs can potentially reduce the amount of data transmitted, but I suspect that as the ref count increases, this has diminishing returns and mostly ends up chewing up CPU and memory in a vain attempt to reduce network traffic.

For a more general solution, perhaps a log of ref updates could be used.
Every time a ref is updated on the server, that ref would be written
into an append-only log.  Every time a client pulls, their pull data
includes an index into that log.  Then on push, the client could say, "I
have refs as-of $index", and the server could read the log (or do
something more-optimized) and send only refs updated since that index.

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html