Re: [RFC] pack-objects: compression level for non-blobs

Jeff King <peff@xxxxxxxx> · Fri, 28 Dec 2012 19:41:05 -0500

On Mon, Nov 26, 2012 at 05:25:54PM +1100, David Michael Barr wrote:

>  The intent is to allow selective recompression of pack data.
>  For small objects/deltas the overhead of deflate is significant.
>  This may improve read performance for the object graph.
> 
>  I ran some unscientific experiments with the chromium repository.
>  With pack.graphcompression = 0, there was a 2.7% increase in pack size.
>  I saw a 35% improvement with cold caches and 43% otherwise on git log --raw.

There wasn't much response to this, but those numbers are encouraging. I
was curious to replicate them, as well as to break it down by trees and
commits. I also wanted to test on more repositories, as well as on both
SSDs and spinning disks (for cold-cache numbers). Maybe that will catch
more people's interest.

As you mentioned in your follow-up, I ran into the "delta size changed"
problem. Not sure if it is related, but I noticed here:

> @@ -379,6 +396,13 @@ static unsigned long write_reuse_object(struct sha1file *f, struct object_entry
>  	offset += entry->in_pack_header_size;
>  	datalen -= entry->in_pack_header_size;
>  
> +	if (!pack_to_stdout &&
> +	    pack_graph_compression_seen &&
> +	    check_pack_compressed(p, &w_curs, offset) != !!compression_level(entry->actual_type)) {
> +		unuse_pack(&w_curs);
> +		return write_no_reuse_object(f, entry, limit, usable_delta);
> +	}
> +

...that we seem to re-compress more than necessary. If I instrument that
block with a message to stderr and run "git repack -ad" repeatedly
without changing the config in between, runs after the first should
never re-compress, right? But they seem to. I'm not sure if your
check_pack_compressed heuristic is off or something else. It may or may
not be related to the "delta sized change" failure.

But we can leave this side a bit for a moment. Conceptually there are
two interesting things going on in your patch:

  1. Per-object-type compression levels

  2. Auto-recompression when levels change.

We can figure out (2) later. The meat of the idea is (1), and the patch
for that is much simpler. In fact, we can test it out with entirely
stock git by creating separate tree, commit, and blob packs, each with
different compression. So that's what I did for my timing, just to keep
things simple.

I timed git.git, linux-2.6.git, and WebKit.git. For each repo, I tested
it with four pack compression scenarios:

  1. all objects at -1 (zlib default)

  2. commits at 0, everything else at -1

  3. trees at 0, everything else at -1

  4. commits and trees at 0, everything else at -1

For each scenario, I timed "git rev-list --count --all" to traverse all
commits (which roughly models things like merge-base, ahead/behind
counts, etc), and then the same thing with "--objects" to traverse all
objects (which would roughly match what "git prune" or the "counting
objects" phase of packing would do). For each command, I timed both warm
and cold disk cache (the latter via "echo 3 >/proc/sys/vm/drop_caches").
Each timing is a best-of-five.  The timings were done on a machine with
an SSD (which probably matters for cold-cache; I have some spinning disk
numbers later).

Here are the git.git numbers:

 Pack  | Size          |  Cold Revs  |  Warm Revs  | Cold Objects | Warm Objects
-------+---------------+-------------+-------------+--------------+--------------
  none |  41.48        | 0.78        | 0.33        |  2.35        |  1.94       
commit |  49.34 (+18%) | 0.57 (-26%) | 0.09 (-74%) |  2.48  (+5%) |  1.70 (-12%)
  tree |  45.43  (+9%) | 0.80  (+3%) | 0.33   (0%) |  2.11  (-9%) |  1.74 (-10%)
  both |  53.31 (+28%) | 0.79  (+1%) | 0.08 (-75%) |  2.27  (-3%) |  1.49 (-23%)

The pack column specifies which scenario (i.e., what was left
uncompressed).  The size column is the size of the object-dir (in
megabytes). The other columns are times to run each command in
wall-clock seconds. Percentages are comparisons to the baseline "none"
case (i.e., the status quo).

So you can see that it's a big win for warm-cache pure-commit
traversals. As a sanity check, we can see that the tree-only case is not
helped at all there (because we do not look at trees at all). The
cold-cache case is helped, too, but that benefit goes away (and even
hurts slightly, but that is probably within the noise) when we also
leave the trees uncompressed.

The full-objects traversal doesn't fare quite as well, though there's
still some improvement. I think it argues for leaving both uncompressed,
as the warm case really benefits when both are uncompressed. You lose
the cold-cache benefits in the revs-only case, though.

Here are the numbers for linux-2.6.git:

 Pack  | Size          |  Cold Revs  |  Warm Revs  | Cold Objects | Warm Objects
-------+---------------+-------------+-------------+--------------+--------------
  none | 864.61        | 8.66        | 4.07        | 42.76        | 36.32       
commit | 970.46 (+12%) | 8.87  (+2%) | 1.02 (-74%) | 42.94   (0%) | 33.43  (-7%)
  tree | 895.37  (+3%) | 9.08  (+4%) | 4.07   (0%) | 36.01 (-15%) | 29.62 (-18%)
  both | 1001.25 (+15%) | 8.90  (+2%) | 1.03 (-74%) | 35.57 (-16%) | 26.25 (-27%)

Similar warm-cache numbers, but the cold cache for the revs-only case is
hurt a little bit more.

And here's WebKit.git (sizes are in gigabytes this time):

 Pack  | Size          |  Cold Revs  |  Warm Revs  | Cold Objects | Warm Objects
-------+---------------+-------------+-------------+--------------+--------------
  none |   3.46        | 1.61        | 1.38        | 20.46        | 18.72       
commit |   3.54  (+2%) | 1.42 (-11%) | 0.34 (-75%) | 20.42   (0%) | 17.57  (-6%)
  tree |   3.59  (+3%) | 1.61   (0%) | 1.39   (0%) | 16.01 (-21%) | 14.00 (-25%)
  both |   3.67  (+6%) | 1.45 (-10%) | 0.34 (-75%) | 15.94 (-22%) | 12.91 (-31%)

Pretty similar again (slightly better on the full object traversal).

And finally, for comparison, here are the numbers from a (much slower)
machine that has spinning disks (albeit in a mirrored raid, which should
improve read times) on git.git:

 Pack  | Size          |  Cold Revs  |  Warm Revs  | Cold Objects | Warm Objects
-------+---------------+-------------+-------------+--------------+--------------
  none |  41.35        | 1.85        | 0.64        |  5.58        |  3.91       
commit |  49.23 (+19%) | 1.94  (+4%) | 0.14 (-77%) |  5.51  (-1%) |  3.40 (-12%)
  tree |  45.27  (+9%) | 1.78  (-3%) | 0.64   (0%) |  5.13  (-8%) |  3.53  (-9%)
  both |  53.16 (+28%) | 1.83  (-1%) | 0.14 (-77%) |  4.96 (-11%) |  3.32 (-14%)

Surprisingly not all that different than the SSD times. Which may mean I
screwed something up. I'm happy to make my test harness available if
anybody else feels like timing on their repos or machines. But it does
point to potentially leaving commits uncompressed, and possibly trees.

I wonder if we could do even better, though. For a traversal, we only
need to look at the commit header. We could potentially do a progressive
inflate and stop before getting to the commit message (which is the bulk
of the data, and the part that is most likely to benefit from
compression).

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html