Re: [PATCH 4/5] multi-pack-index: prepare 'repack' verb

Stefan Beller <sbeller@xxxxxxxxxx> · Mon, 10 Dec 2018 17:54:15 -0800

On Mon, Dec 10, 2018 at 10:06 AM Derrick Stolee via GitGitGadget
<gitgitgadget@xxxxxxxxx> wrote:
>
> From: Derrick Stolee <dstolee@xxxxxxxxxxxxx>
>
> In an environment where the multi-pack-index is useful, it is due
> to many pack-files and an inability to repack the object store
> into a single pack-file. However, it is likely that many of these
> pack-files are rather small, and could be repacked into a slightly
> larger pack-file without too much effort. It may also be important
> to ensure the object store is highly available and the repack
> operation does not interrupt concurrent git commands.
>
> Introduce a 'repack' verb to 'git multi-pack-index' that takes a
> '--batch-size' option. The verb will inspect the multi-pack-index
> for referenced pack-files whose size is smaller than the batch
> size, until collecting a list of pack-files whose sizes sum to
> larger than the batch size. Then, a new pack-file will be created
> containing the objects from those pack-files that are referenced
> by the multi-pack-index. The resulting pack is likely to actually
> be smaller than the batch size due to compression and the fact
> that there may be objects in the pack-files that have duplicate
> copies in other pack-files.

This highlights an optimization problem: How do we pick the
batches optimally?
Ideally we'd pick packs that have an overlap of many
same objects to dedup them completely, next best would
be to have objects that are very similar, such that they delta
very well.
(Assuming that the sum of the resulting pack sizes is a metric
we'd want to optimize for eventually)

For now it seems we just take a random cut of "small" packs.

>
> The current change introduces the command-line arguments, and we
> add a test that ensures we parse these options properly. Since
> we specify a small batch size, we will guarantee that future
> implementations do not change the list of pack-files.

This is another clever trick that makes the test correct
despite no implementation yet. :-)

> +repack::
> +       When given as the verb, collect a batch of pack-files whose
> +       size are all at most the size given by --batch-size,

okay.

>  but
> +       whose sizes sum to larger than --batch-size.

or less if there are not enough packs.

Now it would be interesting if we can specify an upper bound
(e.g. my laptop has 8GB of ram, can I use this incremental
repacking optimally by telling git to make batches of at most
7.5G), the answer seems to follow...

>   The batch is
> +       selected by greedily adding small pack-files starting with
> +       the oldest pack-files that fit the size. Create a new pack-
> +       file containing the objects the multi-pack-index indexes
> +       into thos pack-files, and rewrite the multi-pack-index to

those

> +       contain that pack-file. A later run of 'git multi-pack-index
> +       expire' will delete the pack-files that were part of this
> +       batch.

... but the optimization seems to be rather about getting rid
of the oldest packs first instead of getting as close to the batch
size. (e.g. another way to look at this is to "find the permutation
of all packs that (each are smaller than batch size), but in sum
are the smallest threshold above the batch size).

I guess that the strategy of picking the oldest is just easiest
to implement and should be sufficient for now, but memory
bounds might be interesting to keep in mind, just as the
optimal packing from above.