Hi Linus! On Thu, 2008-04-03 at 14:11 -0700, Linus Torvalds wrote: > > On Thu, 3 Apr 2008, Roman Shaposhnik wrote: > > > > The repository was created using hg2git (the one based on git-fast-import) > > and it was GC'ed and REPACK'ed just in case. > > Before going any further - exactly _how_ was it repacked? I believe it was the following two steps: $ git gc --aggressive $ git repack > In particular, when using importers that do partial packing on their own > (and any "git-fastimport" user is that by definition - and I think > hg2git does that), at the end of it all you have to make sure to repack in > a way where the repacking will totally discard the import-time packfiles. Good point. Speaking of which: do you have an FAQ for importers? The entries in the official FAQ (http://git.or.cz/gitwiki/GitFaq#head-929a8825d04dde226c2530f5337d3b3ed8dcc7ce) seem a bit stale for such an important issue. After all, importing from an existing SCM is what usually forms a first time impression of Git's effectiveness. > IOW, that's one of the very few times you should use "-f" to git repack. Got it! > It's usually also a good place to make sure that since you ignore the old > packing information, it's best to also make sure that the new packing info > is good by using a bigger window (and perhaps a bigger depth). That makes > the packing much slower, of course, but this is meant to be a one-time > event. > > So try something like > > git repack -a -d -f --depth=100 --window=100 > > if you have a good CPU and plenty of memory. That turned out to be a perfect suggestion. Thank you. I'm now the happiest camper ever. And I'm also also pretty dumbfounded ;-) Here's what happened. I started with a a repository filled with "loose" (one object per file) objects (the reason I needed it was for the ease of sleuthing through individual objects and it was created by git-unpack-objects from that initial 1.1Gb pack). And I tried to pack it exactly like you suggested: $ git-pack-objects --depth=100 --window=100 --delta-base-offset --progress pack < objects Generating pack... Counting objects: 1096305 Done counting 1159628 objects. Deltifying 1159628 objects... 100% (1159628/1159628) done Writing 1159628 objects... dd134c407324dc55b0cd2aa3a9e1b3420c2bba3f Total 1159628 (delta 386980), reused 0 (delta 0) and it payed off reasonably well: $ du -s NB-clone 670M NB-clone It still was bigger than the Mercurial repository but at least it got 2 times smaller than the original result of hg2git. Now, if it wasn't for a friend of mine, I probably would've stopped there. But he showed up and saved the day ;-) His comments made me try something that I didn't consider to be of any use -- repacking a freshly packed pack with the *same* --depth=100 --window=100: $ git repack -a -f --window=100 --depth=100 Generating pack... Counting objects: 1056829 Done counting 1159628 objects. Deltifying 1159628 objects... 100% (1159628/1159628) done Writing 1159628 objects... 100% (1159628/1159628) done Total 1159628 (delta 614516), reused 0 (delta 0) Pack pack-dd134c407324dc55b0cd2aa3a9e1b3420c2bba3f created. And then, a miracle occurred: $ du -sh NB-small 268M NB-small Now, don't get me wrong: I'm as happy as a clam. The repository is now *smaller* than the Mercurial's and because the structure of the tree is so weird Git gets major points here. The only question that is still bothering me is: how did it happen? Why did repacking a repository with exactly the same set of objects and the only difference being where these objects resided (former case filesystem, the later case an intermediate pack) made so huge a difference? Please help! > > The last item (trees) also seem to take the most space and the most > > reasonable explanation that I can offer is that NetBeans repository has > > a really weird structure where they have approximately 700 (yes, seven > > hundred!) top-level subdirectories there. They are clearly > > Submodules-shy, but that's another issue that I will need to address > > with them. > > Trees taking the biggest amount of space is not unheard of, and it may > also be that the name heuristics (for finding good packing partners) could > be failign, which would result in a much bigger pack than necessary. Is there any documentation that describes the heuristics involved in creating a pack? > So if you already did an aggressive repack like the above, I'd happily > take a look at whether maybe it's bad heuristics for finding tree objects > to pair up for delta-compression. Do you have a place where you can put > that repo for people to clone and look at? Unfortunately I don't. The only thing I can do is I can always create a *.tar.bz2 and put and on Sun's ftp server. Actually, that makes me wonder: is there any public Git hosting available such that publishing a hefty repository for the forensic purposes only wouldn't violate their terms of use? Thanks, Roman. P.S. Oh, and here's one extra tiny question that I also have: what does the output: Total 1159628 (delta 614516), reused 0 (delta 0) really mean? -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html