Re: [PATCH 05/24] midx: implement `DISP` chunk

Taylor Blau <me@xxxxxxxxxxxx> · Wed, 13 Dec 2023 13:28:44 -0500

Hi Peff,

On Tue, Dec 12, 2023 at 03:03:32AM -0500, Jeff King wrote:
> [...]

Well this took me longer to respond to than I thought it would ;-).

> So it kind of seems to me that this would "just work" if
> try_partial_reuse() did something like for the midx case:
>
>   - convert offset of base object into an object id hash using the pack
>     revindex (similar to offset_to_pack_pos)
>
>   - look up the object id in the midx to get a pack/offset combo
>
>   - use the midx revindex to convert that into a bit position
>
>   - check if that bit is marked for verbatim reuse
>
> The assumption there is that in the second step, the midx has resolved
> duplicates by putting all of pack A before pack B, and so on, as above.
> It also assumes that we are trying verbatim reuse in the same order
> (though a different order would not produce wrong results, it would only
> result in less verbatim reuse).

After much thinking, I agree with your conclusion here. Which is great
news indeed, since it allows us to implement multi-pack reuse without
requiring that the candidate packs be disjoint with respect to their
objects.

Even though we have some protections in place for ensuring these packs'
disjointed-ness, I agree with Junio upthread that this is likely the
most fragile part of this series. That is, even though we check in
midx.c::get_sorted_entries() that the marked packs are disjoint, if we
miss something, the results would be rather bad. At best it would result
in sending a corrupt packfile to the client, and at worst it could
result in permanent data corruption if repacking a repository with
multi-pack reuse enabled.

> If we assume that any cross-pack deltas are a problem, we could always
> just skip them for verbatim reuse. That is, once we look up the object
> id in the midx, we can see if it's in the same pack we're currently
> processing. If not, we could punt and let the non-verbatim code paths
> handle it as usual.

Let me think aloud for a second, since it took me quite a lot of
thinking to fully wrap my head around this. Suppose we have two packs,
P1 and P2 where P1 has object A delta'd against B, and P2 has its own
copy of object B. If the MIDX chose the copy of B from P2, but also
decided to send P1 (which it chose from A), then if P1 is stored as an
OFS delta, we would be sending a corrupt packfile to the client.

I think there are a few things that we could reasonably do here:

  - Reject cross-pack deltas entirely (as you suggest). This is the
    easiest implementation choice in this already-complex series, and it
    doesn't paint us into a corner in the sense that we could relax
    these requirements in the future.

  - Turn any cross-pack deltas (which are stored as OFS_DELTAs) into
    REF_DELTAs. We already do this today when reusing an OFS_DELTA
    without `--delta-base-offset` enabled, so it's not a huge stretch to
    do the same for cross-pack deltas even when `--delta-base-offset` is
    enabled.

    This would work, but would obviously result in larger-than-necessary
    packs, as we in theory *could* represent these cross-pack deltas by
    patching an existing OFS_DELTA. But it's not clear how much that
    would matter in practice. I suspect it would have a lot to do with
    how you pack your repository in the first place.

  - Finally, we could patch OFS_DELTAs across packs in a similar fashion
    as we do today for OFS_DELTAs within a single pack on either side of
    a gap. This would result in the smallest packs of the three options
    here, but implementing this would be more involved.

    At minimum, you'd have to keep the reusable chunks list for all
    reused packs, not just the one we're currently processing. And you'd
    have to ensure that any bases which are a part of cross-pack deltas
    appear before the delta. I think this is possible to do, but would
    require assembling the reusable chunks list potentially in a
    different order than they appear in the source packs.

> And of course there's a bunch of hard work teaching all of those
> functions about midx's and multiple packs in the first place, but you
> already had to do all that in the latter part of your series, and it
> would still be applicable.

Yep, all of that is still a requirement here, and lives on in the
revised version of this series.

The naive implementation where we call try_partial_reuse() on every
object which is a candidate for reuse and check for cross-pack deltas
would work, but have poor performance. The reason is that we would be
doing a significant amount of cache-inefficient work to determine
whether or not the base for some delta/base-pair resides in the same
pack:

  - If you see a delta in some pack while processing a MIDX bitmap for
    reuse, you first have to figure out the location of its base in that
    same pack. (Note: this may or may not be the copy of the base object
    chosen by the MIDX).

  - To figure out whether or not the MIDX chose that copy, you would
    naively have to do something like:

      - Convert the base object's offset into a packfile position using
        the pack revindex.

      - Convert the base object's packfile position into an index
        position, which would then be used to obtain its OID.

      - Then binary search through the MIDX for that OID found in the
        previous step, filling out the MIDX entry for that object.

      - Toss out the cross-pack delta/base pair if the MIDX entry in the
        previous step indicates that the MIDX chose a copy of the base
        from a different pack than the one we're currently processing
        (i.e. where the delta resides).

That's rather inefficient. But, we can do better. Instead of going back
and forth through both the pack and MIDX's reverse index, we can simply
binary search in the MIDX's reverse index for the (pack_id, offset) pair
corresponding to the base.

If we find a match, then we know that the MIDX chose its copy of the
base object from the same pack, and we can reuse the delta/base-pair. If
we don't, then we know that the MIDX chose its copy of the base object
from a different pack, and we have to throw out the delta/base-pair.

This is a bit more involved than the naive implementation, but it's (a)
efficient, and (b) most of the code for it already exists in the form of
midx_to_pack_pos(), which implements a binary search over the MIDX
bitmap's pseudo-pack order.

With some light refactoring, we can repurpose this code to perform a
binary search for a given (pack, offset) pair instead of starting with a
MIDX lex position and converting it into the (pack, offset) pair. So
that works, and is what I've done in the revised version of this series.

There is one other snag, which is that we can no longer blindly reuse
whole-words from the reuse bitmap if we have non-disjoint packs. That
is, we can't do something like:

    while (pos < result->word_alloc && result->words[pos] == (eword_t)~0)
        pos++;

when processing anything but the first pack.

The reason is that we know the first pack has all duplicate object ties
broken in its favor, but we don't have the same guarantee for subsequent
packs. So we have to be more careful about which bits we reuse from
those subsequent packs, since we may inadvertently pick up a cross-pack
delta/base pair without inspecting it more closely.

As I mentioned, you can still perform this optimization over the first
pack, and I think that will be sufficient for most repositories. It's
not clear to me exactly how much this optimization is helping us in
contrast to all of the other work that pack-objects is doing, but that
is probably something worth measuring.

Thanks for the terrific suggestion. I'll clean up the results of trying
to implement it, and share it with the list soon (ideally before the end
of this week, after which I'm on vacation until the new year).

Thanks,
Taylor