Re: [PATCH] pack-format: correct multi-pack-index description

Derrick Stolee <stolee@xxxxxxxxx> · Mon, 10 Feb 2020 10:02:01 -0500

On 2/10/2020 9:50 AM, Johannes Berg wrote:
> On Mon, 2020-02-10 at 09:46 -0500, Derrick Stolee wrote:
> 
>> Part of my initial plan was to have this incremental file format.
>> The commit-graph uses a very similar mechanism. The difference may
>> be that you likely allow multiple .midx files found by scanning the
>> pack directory, 
> 
> Right, just scan and use any midx that exist, then compare the packs in
> there against all the packs found, and then remove any packs that
> actually *are* in an midx from the search list. That leaves you with all
> information, but optimised by midx where possible.
> 
>> but I would expect something like the
>> "commit-graph-chain" file that provides an ordered list of the
>> incremental files. This can be important for deciding when to merge
>> layers or delete old files, and would be critical to the possibility
>> of converting reachability bitmaps to rely on a stable object order
>> stored in the multi-pack-index instead of pack-order.
> 
> Right, if we delete then we have to also remove any midx covering the
> deleted pack, that's pretty rare in bup as a backup tool though.
> 
>> The reason the multi-pack-index has not become incremental is that
>> VFS for Git no longer needs to write it very often. We write the
>> entire multi-pack-index during a background job that triggers once
>> per day. If we needed to write it more frequently, then the incremental
>> format would be more important to us.
> 
> So, wait, what if a new pack is created? Does it just get used in
> addition to the multi-pack-index, if it's not covered by it, like I
> described above?
> 
> If so, I guess it wouldn't actually really matter here. I was afraid
> (but didn't check yet) that git would always use only the single multi-
> pack-index file, and not also search additional packs, so that it always
> has to be maintained in "perfect order" ...

Git loads the multi-pack-index file, which includes a sorted list of
the packs it covers. It then scans the "pack" directory for pack-indexes
and checks if they are covered by the multi-pack-index. If not, then
Git will add them to the packed_git struct and use them as normal.
The hope is that this list of "uncovered" packs is small compared to
the data covered by the multi-pack-index.

This allows Git to continue functioning after an action like "git fetch"
that adds a new pack but may not want to rewrite the multi-pack-index.

Our background maintenance essentially runs these commands:

 1. git multi-pack-index write
 2. git multi-pack-index expire
 3. git multi-pack-index repack

Step 1 ensures all packs are pulled into the multi-pack-index. Step 2
deletes any pack-files whose objects are contained in newer pack-files.
Step 3 creates a new pack-file containing all objects from a set of
small pack-files (using the --batch-size=X option). This process helps
incrementally reduce the size and number of packs. That may be helpful
for your backup took, too.

Perhaps after an incremental multi-pack-index is added, then Git could
(optionally) have a mode that only checks the multi-pack-index to
avoid scanning the packs directory. It would require inserting a
multi-pack-index write into the index-pack logic so Git.

I'm not sure if that mode would be helpful, since the pack directory
scan is typically done once per command and is relatively fast.

>> That said: if someone wanted to contribute an incremental format,
>> then I would be happy to review it!
> 
> I might still get motivated to do so :-)

YOU CAN DO IT! (Did that help?)

-Stolee