Saving space/network on common repos

Craig Silverstein <csilvers@xxxxxxxxxxxxxxx> · Tue, 16 Dec 2014 22:58:05 -0800

At Khan Academy, we are running a Jenkins installation as our build
server.  By design, our Jenkins machine has several different
directories that each hold a copy of the same git repository.  (For
instance, Jenkins may be running tests on our repo at several
different commits at the same time.)  When Jenkins decides to run a
test -- I'm simplifying a bit -- it will pick one of the copies of the
repo, do a 'git fetch origin && git checkout <some commit>' and the
run the tests.

Our repo has a lot of churn and some big files, and this git fetch can
take a long time. I'd like to reduce both the time to fetch and the
disk space used by sharing objects between the repo copies.

My research has turned up three techniques that try to address this use case:
* git clone --reference
* git clone --shared
* git clone <local repo>, which creates hard links

I can probably use any of these approaches, but git clone --reference
would be the easiest to set up.  I would do so by creating a 'cache'
repo that is just created to serve as a reference and not used in any
other way, so I wouldn't have to worry about the dangers with pruning,
accidentally deleting the repo, etc.

My big concern is that all these methods seem to just affect clone.  So:

Question 1) If I do 'git clone --reference, will the reference repo be
used for subsequent fetches as well?  What about 'git clone --shared'?

Question 2) If I git clone a local repo, will subsequent fetches also
create hard links?

Question 3) If the answer to any of the above is yes, how does this
work with packing?  Say I pack the reference repo (being careful not
to prune anything).  Will subsequent fetches still be able to get the
objects they need from the reference repo?

An added complication is submodules.  We have a submodule that is as
big and slow to fetch as our main repository.

Question 4) Is there a practical way to set up submodules so they can
use the same object-sharing framework that the main repo does?

I'm not keen on rewriting .gitmodules in each of my repos, so probably
something that uses info/alternates is the most workable.  I have a
scheme for setting that up that maybe will work, but it's a moot point
if info/alternates doesn't work for fetching.

I'm wondering if the best approach for us might be to use
GIT_OBJECT_DIRECTORY: set GIT_OBJECT_DIRECTORY to the shared cached
directory for each of our repos, so they all fetch to the same place.

Question 5) I haven't seen this mentioned anywhere else, so I'm
guessing it won't work.  Am I missing a big problem?

Question 6) Will git be sad if two different repos that share an
object directory, both do 'git fetch' at the same time?  I could maybe
protect fetches with an flock, but jenkins can do git operations
behind my back so it would be easier if I didn't have to worry about
locking.

Question 7) Is GIT_OBJECT_DIRECTORY supposed to work with subrepos?
In my experimentation, it looks like it doesn't: when I run
'GIT_OBJECT_DIRECTORY=../obj git submodule update --init' it still
puts the objects in .git/modules/<submodule>/objects/.  Is this a bug?
 Is there any way to work around it?

Any suggestions would be appreciated!  It feels to me like this is
something that git should support pretty easily given its
architecture, but I just don't see a way to do it.

Thanks,
craig
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html