Re: [PATCH] midx.c: use `pack-objects --stdin-packs` when repacking

Jeff King <peff@xxxxxxxx> · Tue, 20 Sep 2022 16:06:26 -0400

On Tue, Sep 20, 2022 at 03:49:07PM -0400, Taylor Blau wrote:

> > Is that true also of "multi-pack-index repack"? I guess it would depend
> > on how you invoke it. I admit I don't think I've ever used it myself,
> > since the new "repack --geometric --write-midx" approach matches my
> > mental model. I'm not sure when you'd actually run the "multi-pack-index
> > repack" command. But if you did it with --batch-size=0 (the default), I
> > think we'd end up traversing every object in history.
> 
> We could probably benefit from it, but only if there is a MIDX bitmap
> around to begin with. For instance, you could first try and lookup each
> object you're missing a namehash for and then read its value out of the
> hashcache extension in the MIDX bitmap (assuming the MIDX bitmap exists,
> and has a hashcache).
> 
> But if you don't have a MIDX bitmap, or it has a poor selection of
> commits, then you're out of luck.

You could also use a pack bitmap if there is one (and it's one of the
included packs). But yes, if you have neither, it's no help.

Mostly I'm just concerned that this could have an outsized negative
performance effect if you have a setup like:

  - you have a gigantic repository, say that takes 15 minutes to do a
    full "rev-list --objects" on (like linux.git with all its forks)

  - most of that is in one big pack, but you acquire new packs
    occasionally via pushes, etc

  - doing "git repack --geometric" rolls up the new packs, nicely
    traversing just the new objects

  - doing "git multi-pack-index repack" before your patch is fast-ish.
    It stuffs all the objects into a new pack. But after your patch, it
    does that 15-minute traversal.

But I don't know if that's even realistic, because I'm still wondering
why somebody would run "git multi-pack-index repack" in the first place.
And if they'd always do so with --batch-size anyway, which would
mitigate this (because it gives a geometric-ish approach where we leave
the huge pack untouched).

If it is, then one thing to consider is tying the "do the extra
traversal" feature to the presence / size of excluded packs. And
possibly considering the presence of a bitmap to indicate that it's
worth doing (assuming the optimization there is implemented).

But that sounds like a lot of work to get right, and again, I'm not
really sure of the benefit.

> > The old code went in object order within the midx. Is this sorted by
> > sha1, or the pack pseudo-order? If the former, then that will yield a
> > different order of objects inside pack-objects (since it is seeing the
> > packs in order of our m->pack_names array). I don't _think_ it matters,
> > but I just wanted to double check.
> 
> Good point. This ends up ordering the packs based on their SHA-1
> checksum, and probably should stick to the pack mtimes instead.
> 
> Unfortunately, we discard that information by the time we get to this
> point in midx_repack(). We don't even have it written durably in the
> MIDX, either, so we reconstruct it on-the-fly in
> fill_included_packs_batch() (see the `QSORT()` call there with
> `compare_by_mtime()`).
> 
> I agree that it probably doesn't matter in practice, but it's worth
> trying to match the existing behavior, at least.

Yeah, sorting the packs by mtime might be sensible. I know in the final
midx, we use object order to find the "preferred" pack. And you could
iterate the objects here, passing along their de-duped pack name. But I
don't think we have the objects here in that useful order; that is
really the order of the midx's .rev file, IIRC, and this is probably the
actual sha1 order.

-Peff