Re: Git performance results on a large repository

Sam Vilain <sam@xxxxxxxxxx> · Mon, 06 Feb 2012 13:17:15 -0800

> Sam Vilain: Thanks for the pointer, i didn't realize that
> fast-import was bi-directional.  I used it for generating the
> synthetic repo.  Will look into using it the other way around.
> Though that still won't speed up things like git-blame,
> presumably?

It could, because blame is an operation which primarily works on
the source history with little reference to the working copy.  Of
course this will depend on the quality of the implementation
server-side.  Blame should suit distribution over a cluster, as
it is mostly involved with scanning candidate revisions for
string matches which is the compute intensive part.  Coming up
with candidate revisions has its own cost and can probably also
be distributed, but just working on the lowest loop level might
be a good place to start.

What it doesn't help with is local filesystem operations.  For
this I think a different approach is required, if you can tie
into fam or a similar inode change notification system, then you
should be able to avoid the entire recursive stat on 'git
status'.  I'm not sure --assume-unchanged on its own is a good
idea, you could easily miss things.  Those stat's are useful.

Making the index able to hold just changes to the checked-out
tree, as others have mentioned, would also save the massive reads
and writes you've identified.  Perhaps a more high performance
back-end could be developed.

> The sparse-checkout issue you mention is a good one.

It's actually been on the table since at least GitTogether 2008;
there's been some design discussion on it and I think it's just
one of those features which doesn't have enough demand yet for it
to be built.  It keeps coming up but not from anyone with the
inclination or resources to make it happen.  There is a protocol
issue, but this should be able to fit into the current extension
system.

> There is a good question of how to support quick checkout,
> branch switching, clone, push and so forth.

Sure.  It will be much more network intensive as you are
replacing the part which normally has a very fast link through
the buffercache to pack files etc.  A hybrid approach is also
possible, where objects are fetched individually via fast-import
and cached in a local .git repo.  And I have a hunch that LZOP
compression of the stream may also be a win, but as with all of
these ideas, it would be after profiling identifies it as a choke point 
than just because it sounds good.

> I'll look into the approaches you suggest.  One consideration
> is coming up with a high-leverage approach - i.e. not doing
> heavy dev work if we can avoid it.

Right.  You don't actually need to port the whole of git to Hadoop 
initially, to begin with it can just pass through all commands to a 
server-side git fast-import process.  When you find specific operations 
which are slow then these specific operations can be implemented using a 
Hadoop back-end, and the rest backed to the standard git.  If done using 
a useful plug-in system, these systems could be accepted by the core 
project as an enterprise scaling option.

This could let you get going with the knowledge that the scaling option 
is there should it come out.

> On the other hand, it would be nice if we (including the entire
> community:) ) improve git in areas that others that share
> similar issues benefit from as well.

Like I say, a lot of people have run into this already...

HTH,
Sam
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html