Jonas Fonseca <jonas.fonseca@xxxxxxxxx> wrote: > I will not post the numbers here. They are available in > http://jonas.nitro.dk/tmp/stats.pdf for those interested. The following > is my "analysis" of the numbers. Thanks, this was interesting stuff. > As expected, the randomness of the content of both commit and tag objects > results in a very poor packing performance of only 2%. This is one reason why Jon Smirl was pushing the idea of dictionary based compression. git.git has only 276 unique author lines, yet 37 of them are really the top committers. Not surprisingly Junio C Hamano leads the pack with 3529+ commits... :-) A dictionary based compression would allow us to easily compress Junio's authorship line away from those 3529+ commits into a single string, getting much better compression on the commits. In trees this may work very well too for very common file names, e.g. "Makefile". Yes each tree delta compresses very well against its base (and likely copies the file name from the base even when the SHA1 changed) but if the bases were able to use a common dictionary that would help even more. > The data show that for minimal index files, the packs need to contain > more than 2500 objects. The 24 bytes per-object for the optimal case > includes 20-bytes for the object SHA1, and thus cannot be expected to > become lower. This is just a fundamental property of the pack index file format. The file *MUST* be 1064 bytes of fixed overhead, with 24 bytes of data per object indexed. So the fixed overhead amortizes very quickly over the individual object entries, at which point its exactly 24 bytes per entry. This all of course assumes a 32 bit index (which is the current format). The thing is the Mozilla index is 44 MiB. That's roughly 1.9 million objects. The index itself is larger than the entire git.git pack. On a large repository the index ain't trivial... yet its essential to performance! On the other hand the 1064 bytes of fixed overhead in the index is nothing compared to the overhead in say an RCS file. Or an SVN repository... :-) What I failed to point out in my script (or in my email) is that the 24 bytes of index entry cannot be eliminated, and thus must be added to the "revision cost". In some cases its about the same size as the deltafied revision in the pack file. :-( -- Shawn. - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html