Re: [RFC] pack-objects: compression level for non-blobs

Jeff King <peff@xxxxxxxx> · Sun, 30 Dec 2012 07:05:42 -0500

On Sat, Dec 29, 2012 at 12:27:47AM -0500, Jeff King wrote:

> > If reachability bitmap is implemented, we'll have per-pack cache
> > infrastructure ready, so less work there for commit cache.
> 
> True. I don't want to dissuade you from doing any commit cache work. I
> only wanted to point out that this alternative may have merit because of
> its simplicity (so we can use it until a caching solution exists, or
> even after, if managing the cache has downsides).

So I was thinking about this, which led to some coding, which led to
some benchmarking.

I want to clean up a few things in the code before I post it, but the
general idea is to have arbitrary per-pack cache files in the
objects/pack directory. Like this:

  $ cd objects/pack && ls
  pack-a3e262f40d95fc0cc97d92797ff9988551367b75.commits
  pack-a3e262f40d95fc0cc97d92797ff9988551367b75.idx
  pack-a3e262f40d95fc0cc97d92797ff9988551367b75.pack
  pack-a3e262f40d95fc0cc97d92797ff9988551367b75.parents
  pack-a3e262f40d95fc0cc97d92797ff9988551367b75.timestamps
  pack-a3e262f40d95fc0cc97d92797ff9988551367b75.trees

Each file describes the objects in the matching pack. If a new pack is
generated, you'd throw away the old cache files along with the old pack,
and generate new ones. Or not. These are totally optional, and an older
version of git will just ignore them. A newer version will use them if
they're available, and otherwise fallback to the existing code (i.e.,
reading the whole object from the pack). So you can generate them at
repack time, later on, or not at all. For now I have a separate command
that generates them based on the pack index; if this turns out to be a
good idea, it would probably get called as part of "repack".

Each file is a set of fixed-length records. The "commits" file contains
the sha1 of every commit in the pack (sorted). A binary search of the
mmap'd file gives the position of a particular commit within the list,
and that position is used to index the parents, timestamps, and trees
files (obviously if it is missing, then the other files are useless, but
we already have to be able to fallback to just reading the objects
anyway).

I split it out into multiple files because you can actually operate with
a subset (though in my initial attempt, I transparently plug in at the
parse_commit layer, which means we need all items to consider the commit
"parsed", whether the caller actually cares or not. But in theory a
reader could only want to ask for one item).  Making a "generation"
cache file is an obvious next step (and because we already have
"commits", it is only 4 bytes per commit on top of it). Reachability
bitmaps would be another one (though due to the compression, I am not
sure they will work with a fixed-size record design, so this may need
some modification).

Anyway, here are the numbers I came up with (appended to my earlier
compression numbers):

git.git:
 Pack  | Size          |  Cold Revs  |  Warm Revs  | Cold Objects | Warm Objects
-------+---------------+-------------+-------------+--------------+--------------
  none |  56.72        | 0.68        | 0.33        |  2.45        |  1.94       
commit |  64.61 (+13%) | 0.50 (-26%) | 0.09 (-74%) |  2.42  (-1%) |  1.69 (-13%)
  tree |  60.68  (+6%) | 0.79 (+16%) | 0.33   (0%) |  2.23  (-8%) |  1.75  (-9%)
  both |  68.54 (+20%) | 0.48 (-29%) | 0.08 (-75%) |  2.24  (-8%) |  1.48 (-23%)
 cache |  59.29  (+4%) | 0.57 (-16%) | 0.05 (-84%) |  2.23  (-8%) |  1.66 (-14%)

linux.git:
 Pack  | Size          |  Cold Revs  |  Warm Revs  | Cold Objects | Warm Objects
-------+---------------+-------------+-------------+--------------+--------------
  none | 864.61        | 8.66        | 4.07        | 42.76        | 36.32       
commit | 970.46 (+12%) | 8.87  (+2%) | 1.02 (-74%) | 42.94   (0%) | 33.43  (-7%)
  tree | 895.37  (+3%) | 9.08  (+4%) | 4.07   (0%) | 36.01 (-15%) | 29.62 (-18%)
  both |1001.25 (+15%) | 8.90  (+2%) | 1.03 (-74%) | 35.57 (-16%) | 26.25 (-27%)
 cache | 894.78  (+3%) | 4.88 (-43%) | 0.69 (-83%) | 38.80  (-9%) | 32.79  (-9%)

webkit.git:
 Pack  | Size          |  Cold Revs  |  Warm Revs  | Cold Objects | Warm Objects
-------+---------------+-------------+-------------+--------------+--------------
  none |   3.46        | 1.61        | 1.38        | 20.46        | 18.72       
commit |   3.54  (+2%) | 1.42 (-11%) | 0.34 (-75%) | 20.42   (0%) | 17.57  (-6%)
  tree |   3.59  (+3%) | 1.61   (0%) | 1.39   (0%) | 16.01 (-21%) | 14.00 (-25%)
  both |   3.67  (+6%) | 1.45 (-10%) | 0.34 (-75%) | 15.94 (-22%) | 12.91 (-31%)
 cache |   3.47   (0%) | 0.49 (-69%) | 0.14 (-90%) | 19.53  (-4%) | 17.86  (-4%)

So you can see that it performs even better than no-compression on the
warm-revs case. Which makes sense, since we do not even have to touch
the object data at all, and can do the whole traversal straight out of
the cache. So we do not even have to memcpy the bytes around. And it
takes up even less space (3-4% versus 12-13% on the first two repos).
Which makes sense, because even though we are duplicating some
information that is in the packfile, we are leaving all of the commit
message bodies compressed.

The other interesting thing is that the cold cache performance also
improves by a lot. Again, this makes sense; we are doing the traversal
completely out of cache, and our data is even more tightly packed in the
cache than it is in the packfile.

Of course, it does very little for the full --objects listing, where we
spend most of our time inflating trees. We could couple this with
uncompressed trees (which are not all that much bigger, since the sha1s
do not compress anyway). Or we could have an external tree cache, but
I'm not sure exactly what it would look like (this is basically
reinventing bits of packv4, but doing so in a way that is redundant with
the existing packfile, rather than replacing it). Or since the point of
--objects is usually reachability, it may make more sense to pursue the
bitmap, which should be even faster still.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html