Re: [PATCH v3 08/16] midx: allow marking a pack as preferred

Jeff King <peff@xxxxxxxx> · Tue, 30 Mar 2021 03:11:48 -0400

On Mon, Mar 29, 2021 at 05:15:12PM -0400, Taylor Blau wrote:

> There are two solutions to the problem:
> 
>   - You could write the mtimes in the MIDX itself. This would give you a
>     single point of reference, and resolve the TOCTOU race I just
>     described.
> 
>   - Or, you could forget about mtimes entirely and let the MIDX dictate
>     the pack ordering itself. That resolves the race in a
>     similar-but-different way.
> 
> Of the two, I prefer the latter, but I think it introduces functionality
> that we don't necessarily need yet.

Yeah, I'd strongly favor the latter over the former. The reason to go
with the solution you have in this series is that it doesn't require
changing anything in the on-disk midx format, and we think it is good
enough. But once we are going to change the on-disk format, we might as
well give the writing side as much flexibility as possible.

Of course the mtimes themselves are really just numbers, so in a sense
the two are really equivalent. ;)

> That's because the objects within
> the packs are still ordered as such, and so the compression we get in
> the packs is just as good as it is for single-pack bitmaps. It's only at
> the objects between pack boundaries that any runs of 1s or 0s might be
> interrupted, but there are far fewer pack boundaries than objects, so it
> doesn't seem to matter in practice.

Right. The absolute worst case is a large number of single-object packs,
in which case the bitmap order becomes essentially random with respect
to history (because it would be sorted by sha1 of the packs).

The effect _might_ be measurable in more real-world cases, like say one
big pack and 100 pushes each with a handful of commits. The big pack
would be in good shape, but you have a lot of extra pack boundaries that
hurt the bitmap compression.

But in practice, generating bitmaps is expensive enough that you'd
probably want to roll up some of the packs anyway (and that is certainly
what we are doing at GitHub, using your "repack --geometric"). So you'd
end usually with one big pack representing most of history, and then a
handful of roll-up packs.

So I'm a little curious whether one could even measure the impact of,
say, 100 little packs. But not enough to even run the experiment,
because even that is not a case that is really that interesting.

> Anyway, I think that you know all of that already (mostly because we
> thought aloud together when I originally brought this up), but I figure
> that this detail may be interesting for other readers, too.

Indeed. And I know that you know everything I just wrote, but I agree
it's nice to get a record of these discussions onto the list. :)

-Peff