On Wed, Aug 5, 2009 at 11:39 PM, Nicolas Pitre<nico@xxxxxxx> wrote: > On Tue, 4 Aug 2009, Hin-Tak Leung wrote: > >> I cloned gcc's git about a week ago to work on some problems I have >> with gcc on minor platforms, just plain 'git clone >> git://gcc.gnu.org/git/gcc.git gcc' .and ran gcc fetch about daily, and >> 'git rebase origin' from time to time. I don't have local changes, >> just following and monitoring what's going on in gcc. So after a week, >> I thought I'd do a git gc . Then it goes very bizarre. >> >> Before I start 'git gc', .The whole of .git was about 700MB and >> git/objects/pack was a bit under 600MB, with a few other directories >> under .git/objects at 10's of K's and a few 30000-40000K's, and the >> checkout was, well, the size of gcc source code. But after I started >> git gc, the message stays in the 'counting objects' at about 900,000 >> for a long time, while a lot of directories under .git/objects/ gets a >> bit large, and .git blows up to at least 7GB with a lot of small files >> under .git/objects/*/, before seeing as I will run out of disk space, >> I kill the whole lot and ran git clone again, since I don't have any >> local change and there is nothing to lose. >> >> I am running git version 1.6.2.5 (fedora 11). Is there any reason why >> 'git gc' does that? > > There is probably a reason, although a bad one for sure. > > Well... OK. > > It appears that the git installation serving clone requests for > git://gcc.gnu.org/git/gcc.git generates lots of unreferenced objects. I > just cloned it and the pack I was sent contains 1383356 objects (can be > determined with 'git show-index < .git/objects/pack/*.idx | wc -l'). > However, there are only 978501 actually referenced objects in that > cloned repository ( 'git rev-list --all --objects | wc -l'). That makes > for 404855 useless objects in the cloned repository. > > Now git has a safety mechanism to _not_ delete unreferenced objects > right away when running 'git gc'. By default unreferenced objects are > kept around for a period of 2 weeks. This is to make it easy for you to > recover accidentally deleted branches or commits, or to avoid a race > where a just-created object in the process of being but not yet > referenced could be deleted by a 'git gc' process running in parallel. > > So to give that grace period to packed but unreferenced objects, the > repack process pushes those unreferenced objects out of the pack into > their loose form so they can be aged and eventually pruned. Objects > becoming unreferenced are usually not that many though. Having 404855 > unreferenced objects is quite a lot, and being sent those objects in the > first place via a clone is stupid and a complete waste of network > bandwidth. > > Anyone has an idea of the git version running on gcc.gnu.org? It is > certainly buggy and needs fixing. > > Anyway... To solve your problem, you simply need to run 'git gc' with > the --prune=now argument to disable that grace period and get rid of > those unreferenced objects right away (safe only if no other git > activities are taking place at the same time which should be easy to > ensure on a workstation). The resulting .git/objects directory size > will shrink to about 441 MB. If the gcc.gnu.org git server was doing > its job properly, the size of the clone transfer would also be > significantly smaller, meaning around 414 MB instead of the current 600+ > MB. > > And BTW, using 'git gc --aggressive' with a later git version (or > 'git repack -a -f -d --window=250 --depth=250') gives me a .git/objects > directory size of 310 MB, meaning that the actual repository with all > the trunk history is _smaller_ than the actual source checkout. If that > repository was properly repacked on the server, the clone data transfer > would be 283 MB. This is less than half the current clone transfer > size. > > > Nicolas > 'git gc --prune=now' does work, but 'git gc --prune=now --aggressive' (before) and 'git gc --aggressive' (after) both create very large (>2GB; I stopped it) packs from the ~400MB-600MB packed objects. I noted that you specifically wrote 'with a later git version' - presumably there is a some sort of a known and fixed issue there? Just curious. I guess --aggressive doesn't always save space... Hin-Tak -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html