On Thu, 10 Jan 2008, Pierre Habouzit wrote: > Well, lzma is excellent for *big* chunks of data, but not that impressive for > small files: > > $ ll git.c git.c.gz git.c.lzma git.c.lzop > -rw-r--r-- 1 madcoder madcoder 12915 2008-01-09 13:47 git.c > -rw-r--r-- 1 madcoder madcoder 4225 2008-01-10 10:00 git.c.gz > -rw-r--r-- 1 madcoder madcoder 4094 2008-01-10 10:00 git.c.lzma > -rw-r--r-- 1 madcoder madcoder 5068 2008-01-10 09:59 git.c.lzop This is really the big point here. Git uses _lots_ of *small* objects, usually much smaller than 12KB. For example, my copy of the gcc repository has an average of 270 _bytes_ per compressed object, and objects must be individually compressed. Performance with really small objects should be the basis for any Git compression algorithm comparison. > Though I don't agree with you (and some others) about the fact that > gzip is fast enough. It's clearly a bottleneck in many log related > commands where you would expect it to be rather IO bound than CPU > bound. LZO seems like a fairer choice, especially since what it makes > gain is basically the compression of the biggest blobs, aka the delta > chains heads. The delta heads, though, are far from being the most frequently accessed objects. First they're clearly in minority, and often cached in the delta base cache. > It's really unclear to me if we really gain in > compressing the deltas, trees, and other smallish informations. Remember that delta objects represent the vast majority of all objects. For example, my kernel repo currently has 555015 delta objects out of 677073 objects, or 82% of the total. There is actually only 25869 non deltified blob objects which are likely to be the larger objects, but they represent only 4% of the total. But just let's try not compressing delta objects so to check your assertion with the following hack: diff --git a/builtin-pack-objects.c b/builtin-pack-objects.c index a39cb82..252b03e 100644 --- a/builtin-pack-objects.c +++ b/builtin-pack-objects.c @@ -433,7 +433,10 @@ static unsigned long write_object(struct sha1file *f, } /* compress the data to store and put compressed length in datalen */ memset(&stream, 0, sizeof(stream)); - deflateInit(&stream, pack_compression_level); + if (obj_type == OBJ_REF_DELTA || obj_type == OBJ_OFS_DELTA) + deflateInit(&stream, 0); + else + deflateInit(&stream, pack_compression_level); maxsize = deflateBound(&stream, size); out = xmalloc(maxsize); /* Compress it */ You then only need to run 'git repack -a -f -d' with and without the above patch. Here's my rather surprising results: My kernel repo pack size without the patch: 184275401 bytes Same repo with the above patch applied: 205204930 bytes So it is only 11% larger. I was expecting much more. I'll let someone else do profiling/timing comparisons. > What is obvious to me is that lzop seems to take 10% more space than gzip, > while being around 1.5 to 2 times faster. Of course this is very sketchy and a > real test with git will be better. Right. Abstracting the zlib code and having different compression algorithms tested in the Git context is the only way to do meaningful comparisons. Nicolas - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html