Re: git gc expanding packed data?

Nicolas Pitre <nico@xxxxxxx> · Wed, 05 Aug 2009 18:39:55 -0400 (EDT)

On Tue, 4 Aug 2009, Hin-Tak Leung wrote:

> I cloned gcc's git about a week ago to work on some problems I have
> with gcc on minor platforms, just plain 'git clone
> git://gcc.gnu.org/git/gcc.git gcc' .and ran gcc fetch about daily, and
> 'git rebase origin' from time to time. I don't have local changes,
> just following and monitoring what's going on in gcc. So after a week,
> I thought I'd do a git gc . Then it goes very bizarre.
> 
> Before I start 'git gc', .The whole of .git was about 700MB and
> git/objects/pack was a bit under 600MB, with a few other directories
> under .git/objects at 10's of K's and a few 30000-40000K's, and the
> checkout was, well, the size of gcc source code. But after I started
> git gc, the message stays in the 'counting objects' at about 900,000
> for a long time, while a lot of directories under .git/objects/ gets a
> bit large, and .git blows up to at least 7GB with a lot of small files
> under .git/objects/*/, before seeing as I will run out of disk space,
> I kill the whole lot and ran git clone again, since I don't have any
> local change and there is nothing to lose.
> 
> I am running git version 1.6.2.5 (fedora 11). Is there any reason why
> 'git gc' does that?

There is probably a reason, although a bad one for sure.

Well... OK.

It appears that the git installation serving clone requests for 
git://gcc.gnu.org/git/gcc.git generates lots of unreferenced objects. I 
just cloned it and the pack I was sent contains 1383356 objects (can be 
determined with 'git show-index < .git/objects/pack/*.idx | wc -l').  
However, there are only 978501 actually referenced objects in that 
cloned repository ( 'git rev-list --all --objects | wc -l').  That makes 
for 404855 useless objects in the cloned repository.

Now git has a safety mechanism to _not_ delete unreferenced objects 
right away when running 'git gc'.  By default unreferenced objects are 
kept around for a period of 2 weeks.  This is to make it easy for you to 
recover accidentally deleted branches or commits, or to avoid a race 
where a just-created object in the process of being but not yet 
referenced could be deleted by a 'git gc' process running in parallel.

So to give that grace period to packed but unreferenced objects, the 
repack process pushes those unreferenced objects out of the pack into 
their loose form so they can be aged and eventually pruned.  Objects 
becoming unreferenced are usually not that many though.  Having 404855 
unreferenced objects is quite a lot, and being sent those objects in the 
first place via a clone is stupid and a complete waste of network 
bandwidth.

Anyone has an idea of the git version running on gcc.gnu.org?  It is 
certainly buggy and needs fixing.

Anyway... To solve your problem, you simply need to run 'git gc' with 
the --prune=now argument to disable that grace period and get rid of 
those unreferenced objects right away (safe only if no other git 
activities are taking place at the same time which should be easy to 
ensure on a workstation).  The resulting .git/objects directory size 
will shrink to about 441 MB.  If the gcc.gnu.org git server was doing 
its job properly, the size of the clone transfer would also be 
significantly smaller, meaning around 414 MB instead of the current 600+ 
MB.

And BTW, using 'git gc --aggressive' with a later git version (or
'git repack -a -f -d --window=250 --depth=250') gives me a .git/objects 
directory size of 310 MB, meaning that the actual repository with all 
the trunk history is _smaller_ than the actual source checkout.  If that 
repository was properly repacked on the server, the clone data transfer 
would be 283 MB.  This is less than half the current clone transfer 
size.

Nicolas
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html