Idea: global git object cache

Avery Pennarun <apenwarr@xxxxxxxxx> · Fri, 8 Jan 2010 16:05:15 -0500

Hi all,

One thing I find curious about git is how objects mostly aren't shared
between multiple repositories on the local system.  For example, if I
do:

   git clone git://git.kernel.org/pub/scm/git/git.git  git1
   git clone git://git.kernel.org/pub/scm/git/git.git  git2

Then I end up downloading the same objects from kernel.org *twice*.
If I use --reference on the second clone, then I can avoid
re-downloading all the objects, and it's much faster.

Unfortunately, I have to provide that option by hand, which is a
problem for git-submodule: it goes out to clone someone else's
repository automatically and doesn't know how to guess a value for
--reference.  Another thing I commonly want to do with submodules is
to rm -rf the submodule's files, eg. because I change branches and git
doesn't clean it automatically.  But then when I switch branches back
to the one with the submodule, git wants to go re-download the
submodule *again*.  Redoing the checkout makes sense to me (just as
git deletes/recreates files when I normally switch branches) but
re-downloading seems silly.

So here's my suggestion to minimize downloads in a pretty easy way:

- whenever git creates a packfile in any repo (eg. during git gc or
git fetch), make an *extra* hardlink of it into
~/.gitcache/objects/pack.

- whenever git is considering which objects it does/doesn't currently
have, also consider the packs in ~/.gitcache/objects/pack (ie. using
the git/objects/alternates mechanism).  If one of the packs qualifies,
hardlink it into the current repo.  Maybe give it a .keep file to
indicate that it's counterproductive to repack this pack.

- after git deletes a packfile in any repo (eg. during git gc), check
the link count of that pack in ~/.gitcache/objects/pack; if it's now
down to just 1, there are no other users of the pack, so delete it
there too.  You would also need to prune the cachedir occasionally to
deal with repositories that were deleted in other ways (eg. rm -rf).

- share the list of refs in a similar way (noticing that you probably
have different refs in multiple repos that are named
"refs/heads/master" of course) so that fetches will be efficient.

- extra improvement to submodule behaviour: hardlink packs from the
submodule into the supermodule's objects/pack directory (or use a
different directory like .git/submodules/pack to keep things
separate).  Also, submodules should use the superproject's pack
directory as an alternate, in case (as often happens for me) the
supermodule already contains a bunch of objects from the submodule,
because the modules were split at some point.

I believe this would be quite easy to implement and would give an
immediate efficiency improvement.  The ~/.gitcache feature could be
enabled/disabled by a config option.  Is there any reason not to do
it?

Thanks,

Avery

P.S. I've been testing git's behaviour with lots of very large packs -
I'm currently using about 58 packs of about 1 GB each - as part of my
'bup' git-based backup tool (http://apenwarr.ca/log/?m=201001#04).
Repacking and fsck are obviously horrendously slow with that much
data, but bup avoids those operations as much as possible, and a
~/.gitcache wouldn't need to worry about them either (since each repo
is still responsible for repacking its own packs).  Overall
performance for other git operations seems to be fine, though.  And
searching the cache as a last restore can be optimized by always
searching packs in MRU order, in case git doesn't already do this.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html