Re: Decompression speed: zip vs lzo

Nicolas Pitre <nico@xxxxxxx> · Thu, 10 Jan 2008 15:39:07 -0500 (EST)

On Thu, 10 Jan 2008, Pierre Habouzit wrote:

> Well, lzma is excellent for *big* chunks of data, but not that impressive for
> small files:
> 
> $ ll git.c git.c.gz git.c.lzma git.c.lzop
> -rw-r--r-- 1 madcoder madcoder 12915 2008-01-09 13:47 git.c
> -rw-r--r-- 1 madcoder madcoder  4225 2008-01-10 10:00 git.c.gz
> -rw-r--r-- 1 madcoder madcoder  4094 2008-01-10 10:00 git.c.lzma
> -rw-r--r-- 1 madcoder madcoder  5068 2008-01-10 09:59 git.c.lzop

This is really the big point here.  Git uses _lots_ of *small* objects, 
usually much smaller than 12KB.  For example, my copy of the gcc 
repository has an average of 270 _bytes_ per compressed object, and 
objects must be individually compressed.

Performance with really small objects should be the basis for any 
Git compression algorithm comparison.

> Though I don't agree with you (and some others) about the fact that 
> gzip is fast enough. It's clearly a bottleneck in many log related 
> commands where you would expect it to be rather IO bound than CPU 
> bound.  LZO seems like a fairer choice, especially since what it makes 
> gain is basically the compression of the biggest blobs, aka the delta 
> chains heads.

The delta heads, though, are far from being the most frequently accessed 
objects.  First they're clearly in minority, and often cached in the 
delta base cache.

> It's really unclear to me if we really gain in 
> compressing the deltas, trees, and other smallish informations.

Remember that delta objects represent the vast majority of all objects. 
For example, my kernel repo currently has 555015 delta objects out of 
677073 objects, or 82% of the total.  There is actually only 25869 non 
deltified blob objects which are likely to be the larger objects, but 
they represent only 4% of the total.

But just let's try not compressing delta objects so to check your 
assertion with the following hack:

diff --git a/builtin-pack-objects.c b/builtin-pack-objects.c
index a39cb82..252b03e 100644
--- a/builtin-pack-objects.c
+++ b/builtin-pack-objects.c
@@ -433,7 +433,10 @@ static unsigned long write_object(struct sha1file *f,
 		}
 		/* compress the data to store and put compressed length in datalen */
 		memset(&stream, 0, sizeof(stream));
-		deflateInit(&stream, pack_compression_level);
+		if (obj_type == OBJ_REF_DELTA || obj_type == OBJ_OFS_DELTA)
+			deflateInit(&stream, 0);
+		else
+			deflateInit(&stream, pack_compression_level);
 		maxsize = deflateBound(&stream, size);
 		out = xmalloc(maxsize);
 		/* Compress it */

You then only need to run 'git repack -a -f -d' with and without the 
above patch.

Here's my rather surprising results:

My kernel repo pack size without the patch:	184275401 bytes
Same repo with the above patch applied:		205204930 bytes

So it is only 11% larger.  I was expecting much more.

I'll let someone else do profiling/timing comparisons.

> What is obvious to me is that lzop seems to take 10% more space than gzip,
> while being around 1.5 to 2 times faster. Of course this is very sketchy and a
> real test with git will be better.

Right.  Abstracting the zlib code and having different compression 
algorithms tested in the Git context is the only way to do meaningful 
comparisons.


Nicolas
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html