Re: [PATCH v2 11/18] maintenance: auto-size incremental-repack batch

Derrick Stolee <stolee@xxxxxxxxx> · Fri, 24 Jul 2020 15:51:55 -0400

On 7/23/2020 6:15 PM, Junio C Hamano wrote:
> It might be too late to ask this now, but how does the quality of
> the resulting combined pack ensured, wrt locality and deltification?

There are two questions here, really.

The first is: given the set of objects to pack, are we packing
them as efficiently as possible?

Since e11d86de139 (midx: teach "git multi-pack-index repack" honor
"git repack" configurations, 2020-05-10), the 'repack' subcommand
honors the configured recommendations for deltas. This includes:

 (requires updating the arguments to pack-objects)
 * repack.useDeltaBaseOffset
 * repack.useDeltaIslands

 (automatically respected by pack-objects)
 * repack.packKeptsObjects
 * pack.threads
 * pack.depth
 * pack.window
 * pack.windowMemory
 * pack.deltaCacheSize
 * pack.deltaCacheLimit

All of these config settings allow the user to specify how hard
to try for delta compression. If they know something about their
data or their tolerance for extra CPU time during pack-objects,
then they can get better deltas by changing these values.

The second question is "how well do the deltas compress when
only packing incrementally versus packing the entire repo?"

One important way to consider these things is how the pack-
files are created. If we expect most pack-files coming from
'git fetch' calls, then there are some interesting patterns
that arise.

I started measuring by creating a local clone of the Linux
kernel repo starting at v5.0 and then fetching an increment
of ten commits from the first-parent history of later tags.
Each fetch created a pack-file of ~300 MB relative to the
base pack-file of ~1.6 GB. Collecting ten of these in a row
leads to almost 2 GB of "fetched" packs.

However, keep in mind that we didn't fetch 2 GB of data
"across the wire" but instead expanded the thin pack into
a full pack by copying the base objects. After running the
incremental-repack step, that ~2 GB of data re-compresses
back down to one pack-file of size ~300 MB.

_Why_ did 10 pack-files all around 300 MB get repacked at
once? It's because there were duplicate objects across those
pack-files! Recall that the multi-pack-index repack computes
batch sizes by computing an "estimated pack size" by counting
how many objects in that pack-file are referenced by the
multi-pack-index, then computing

  expected size = actual size * num objects
                              / num referenced objects

In this case, the "base" objects that are copied between the
fetches already exist in these smaller pack-files. Thus, when
the batch-size is ~300 MB it still repacks all 10 "small"
packs into a new pack that is still ~300 MB.

Now, this is still a little wasteful. That second pack has
a significant "extra space" cost. However, it came at a bonus
of writing much less data.

Perhaps the Linux kernel repository is just too small to care
about this version of maintenance? In such a case, I can work
to introduce a 'full-repack' task that is more aggressive with
repacking all pack-files. This could use the multi-pack-index
repack with --batch-size=0 to still benefit from the
repack/expire trick for safe concurrency.

Other ideas are to try repacking in other ways, such as by
object type, to maximize easy wins. For example, perhaps we
repack all of the commits and trees every time, but leave the
blobs to be repacked when we are ready to spend time on
removing deltas?

I think the incremental-repack has value, and perhaps it is
isolated to super-huge repositories. That can be controlled
by limiting its use to those when an expert user configures
Git to use it.

I remain open to recommendations from others with more
experience with delta compression to recommend alternatives.

tl,dr: the incremental-repack isn't the most space-efficient
thing we can do, and that's by design.

Thanks,
-Stolee