Re: [PATCH 4/5] multi-pack-index: prepare 'repack' verb

Derrick Stolee <stolee@xxxxxxxxx> · Tue, 11 Dec 2018 07:45:01 -0500

On 12/10/2018 8:54 PM, Stefan Beller wrote:
On Mon, Dec 10, 2018 at 10:06 AM Derrick Stolee via GitGitGadget
<gitgitgadget@xxxxxxxxx> wrote:
From: Derrick Stolee <dstolee@xxxxxxxxxxxxx>

In an environment where the multi-pack-index is useful, it is due
to many pack-files and an inability to repack the object store
into a single pack-file. However, it is likely that many of these
pack-files are rather small, and could be repacked into a slightly
larger pack-file without too much effort. It may also be important
to ensure the object store is highly available and the repack
operation does not interrupt concurrent git commands.

Introduce a 'repack' verb to 'git multi-pack-index' that takes a
'--batch-size' option. The verb will inspect the multi-pack-index
for referenced pack-files whose size is smaller than the batch
size, until collecting a list of pack-files whose sizes sum to
larger than the batch size. Then, a new pack-file will be created
containing the objects from those pack-files that are referenced
by the multi-pack-index. The resulting pack is likely to actually
be smaller than the batch size due to compression and the fact
that there may be objects in the pack-files that have duplicate
copies in other pack-files.
This highlights an optimization problem: How do we pick the
batches optimally?
Ideally we'd pick packs that have an overlap of many
same objects to dedup them completely, next best would
be to have objects that are very similar, such that they delta
very well.
(Assuming that the sum of the resulting pack sizes is a metric
we'd want to optimize for eventually)

For now it seems we just take a random cut of "small" packs.

The current change introduces the command-line arguments, and we
add a test that ensures we parse these options properly. Since
we specify a small batch size, we will guarantee that future
implementations do not change the list of pack-files.
This is another clever trick that makes the test correct
despite no implementation yet. :-)

+repack::
+       When given as the verb, collect a batch of pack-files whose
+       size are all at most the size given by --batch-size,
okay.

  but
+       whose sizes sum to larger than --batch-size.
or less if there are not enough packs.

If there are not enough packs to reach the batch size, then we do nothing.

Now it would be interesting if we can specify an upper bound
(e.g. my laptop has 8GB of ram, can I use this incremental
repacking optimally by telling git to make batches of at most
7.5G), the answer seems to follow...

Well, this gets us to the "unsigned long" problem and how it pervades 
the OPT_MAGNITUDE() code and things that use that kind of parameter. 
This means that Windows users can specify a maximum of (1 << 32) - 1. 
This size is intended to be large enough to make a reasonable change in 
the pack organization without rewriting the entire repo (e.g. 1GB). If 
there is another word than "repack" that means "collect objects from a 
subset of pack-files and place them into a new pack-file, doing a small 
amount of redeltification in the process" then I am open to suggestions.

I tried (briefly) to fix this dependence, but it got too large for me to 
handle at the same time as this change. I'll consider revisiting it 
again later.

   The batch is
+       selected by greedily adding small pack-files starting with
+       the oldest pack-files that fit the size. Create a new pack-
+       file containing the objects the multi-pack-index indexes
+       into thos pack-files, and rewrite the multi-pack-index to
those

+       contain that pack-file. A later run of 'git multi-pack-index
+       expire' will delete the pack-files that were part of this
+       batch.
... but the optimization seems to be rather about getting rid
of the oldest packs first instead of getting as close to the batch
size. (e.g. another way to look at this is to "find the permutation
of all packs that (each are smaller than batch size), but in sum
are the smallest threshold above the batch size).

You are describing the subset-sum problem, with an additional 
optimization component. While there are dynamic programming approaches 
that are usually effective (if the sum is small), this problem is 
NP-complete, and hence could lead to complications.

I guess that the strategy of picking the oldest is just easiest
to implement and should be sufficient for now, but memory
bounds might be interesting to keep in mind, just as the
optimal packing from above.

It is easy to implement, and fast. Further, we do have a heuristic that 
the pack modified time correlates with the time the objects were 
introduced to the repository and hence may compress well when placed 
together.

Thanks,

-Stolee