Re: Bug report: v2.47.0 cannot fetch version 1 pack indexes

Jeff King <peff@xxxxxxxx> · Tue, 22 Oct 2024 01:00:40 -0400

On Mon, Oct 21, 2024 at 04:23:57PM -0400, Taylor Blau wrote:

> On Sat, Oct 19, 2024 at 09:24:55PM -0400, Jeff King wrote:
> > There are a few things I don't like there, though:
> >
> >   - obviously reversing the tempfile name back to the idx is hacky. We
> >     could probably store the original idx that led us to this pack and
> >     use that.
> 
> TBH, I don't think that this is all that hacky, but...

Mostly I was worried that there was some case where the filename would
not be the usual pack-$hash.idx that we expected. But having now looked
at the code quite a bit to produce the follow-on patch, I think these
names are always generated locally from sha1_pack_name(), etc.

Another variant of this approach is to actually teach index-pack a flag
to say "it's OK to overwrite an existing .idx file, even if the bytes
don't match". That's the simplest fix _conceptually_, in that it takes
us back to the pre-2.47 behavior. But plumbing through that option would
require a lot of ugly and confusing boilerplate code.

> > So IMHO a cleaner fix is that we should keep the stuff we download from
> > the remote marked as temporary files, and then throw them away when we
> > complete the fetch (rather than just expecting index-pack to overwrite
> > them).
> 
> ...this seems like a much better idea to me. We shouldn't be keeping the
> provided *.idx file given that we plan on overwriting it later on in the
> process.

I dug on this a little more, and...I don't love the design of the
dumb-http protocol.

A big question here is: what should the life cycle of those downloaded
.idx files be? And how much should we try to avoid re-downloading them
(keeping in mind that the .idx for linux.git is almost 300MB)?

In a normal dumb-http clone, something like this happens with the
current code:

  1. We see we need some object X.

  2. We download _all_ of the .idx files offered by the remote, since we
     have no idea in which one we'll find X. (There's an obvious
     low-hanging optimization to stop when we find X, but in the worst
     case we need all of them anyway).

  3. For each one, we download the matching pack if needed. In most
     cases you need all of them, but there might be some pack P for
     which we only have the .idx. We leave that bare .idx sitting in the
     objects directory.

So now we have all of the other side's .idx files locally, but not
necessarily all of their .packs. And here are some interesting follow-on
cases:

  a. You do a follow-up fetch, and now you need object Y. The saved idx
     of P is useful, because you can avoid downloading it again. Yay.
     You likewise should avoid re-downloading any for which you did get
     the matching pack, because you'll have pack-$hash.idx locally (just
     your generated copy, not the one you got from the other side).

  b. The other side repacks, and then you fetch. Now all their pack
     names are changed. So your saved idx of P is useless. But also you
     end up downloading all of the new .idx files anyway (even though
     you probably already have 99% of the objects).

  c. You repack, then fetch. Now the other side has the same packs, but
     your local copies all have new names. You re-download every idx
     from the other side, not realizing that you already saw them last
     time.

With my patch, case (a) doesn't save P (because now we treat it like a
tempfile), so we re-download it unnecessarily. That's a little worse.
I think the other two cases are largely the same.

But there's a flip side here. What if we kept the .idx files we got from
the other side in objects/pack/remote-idx/, basically as a cache? Now
case (c) avoids a lot of pointless downloading, though it comes at the
cost of storing those extra .idx files (which, before we repack, are
basically duplicates of the .idx files we generated ourselves; if we
really wanted to get clever we could probably hard-link or something).

We didn't improve (b). And in fact, we've created a new issue: we're
still holding on to the old cached .idx files. We'd have to have some
scheme to get rid of them. I suppose that happens when we fetch the
other side's objects/info/packs file and see that they're no longer
mentioned (though if you want to be pedantic, this really should be a
per-remote cache, since in theory two remotes could hold the same pack).

So...what does it all mean for this fix?

I think the patch I posted earlier, to keep the .idx files as separate
tempfiles, is only slightly worse than the status quo (it doesn't keep
P.idx that we didn't need). And given my general feelings towards the
dumb-http protocol, maybe that is the right place to stop.

It just feels like we _could_ be improving things with a better managed
.idx cache system. And that would fix the regression at the same time.
But I'm not sure it's worth sinking time and complexity into this
terribly inefficient protocol.

(Part of why I laid all this out is that until wrote all of the above, I
wasn't completely convinced that case (a) was the only one that
suffered, and that it was so mild).

-Peff