On Mon, Oct 21, 2024 at 04:23:57PM -0400, Taylor Blau wrote: > On Sat, Oct 19, 2024 at 09:24:55PM -0400, Jeff King wrote: > > There are a few things I don't like there, though: > > > > - obviously reversing the tempfile name back to the idx is hacky. We > > could probably store the original idx that led us to this pack and > > use that. > > TBH, I don't think that this is all that hacky, but... Mostly I was worried that there was some case where the filename would not be the usual pack-$hash.idx that we expected. But having now looked at the code quite a bit to produce the follow-on patch, I think these names are always generated locally from sha1_pack_name(), etc. Another variant of this approach is to actually teach index-pack a flag to say "it's OK to overwrite an existing .idx file, even if the bytes don't match". That's the simplest fix _conceptually_, in that it takes us back to the pre-2.47 behavior. But plumbing through that option would require a lot of ugly and confusing boilerplate code. > > So IMHO a cleaner fix is that we should keep the stuff we download from > > the remote marked as temporary files, and then throw them away when we > > complete the fetch (rather than just expecting index-pack to overwrite > > them). > > ...this seems like a much better idea to me. We shouldn't be keeping the > provided *.idx file given that we plan on overwriting it later on in the > process. I dug on this a little more, and...I don't love the design of the dumb-http protocol. A big question here is: what should the life cycle of those downloaded .idx files be? And how much should we try to avoid re-downloading them (keeping in mind that the .idx for linux.git is almost 300MB)? In a normal dumb-http clone, something like this happens with the current code: 1. We see we need some object X. 2. We download _all_ of the .idx files offered by the remote, since we have no idea in which one we'll find X. (There's an obvious low-hanging optimization to stop when we find X, but in the worst case we need all of them anyway). 3. For each one, we download the matching pack if needed. In most cases you need all of them, but there might be some pack P for which we only have the .idx. We leave that bare .idx sitting in the objects directory. So now we have all of the other side's .idx files locally, but not necessarily all of their .packs. And here are some interesting follow-on cases: a. You do a follow-up fetch, and now you need object Y. The saved idx of P is useful, because you can avoid downloading it again. Yay. You likewise should avoid re-downloading any for which you did get the matching pack, because you'll have pack-$hash.idx locally (just your generated copy, not the one you got from the other side). b. The other side repacks, and then you fetch. Now all their pack names are changed. So your saved idx of P is useless. But also you end up downloading all of the new .idx files anyway (even though you probably already have 99% of the objects). c. You repack, then fetch. Now the other side has the same packs, but your local copies all have new names. You re-download every idx from the other side, not realizing that you already saw them last time. With my patch, case (a) doesn't save P (because now we treat it like a tempfile), so we re-download it unnecessarily. That's a little worse. I think the other two cases are largely the same. But there's a flip side here. What if we kept the .idx files we got from the other side in objects/pack/remote-idx/, basically as a cache? Now case (c) avoids a lot of pointless downloading, though it comes at the cost of storing those extra .idx files (which, before we repack, are basically duplicates of the .idx files we generated ourselves; if we really wanted to get clever we could probably hard-link or something). We didn't improve (b). And in fact, we've created a new issue: we're still holding on to the old cached .idx files. We'd have to have some scheme to get rid of them. I suppose that happens when we fetch the other side's objects/info/packs file and see that they're no longer mentioned (though if you want to be pedantic, this really should be a per-remote cache, since in theory two remotes could hold the same pack). So...what does it all mean for this fix? I think the patch I posted earlier, to keep the .idx files as separate tempfiles, is only slightly worse than the status quo (it doesn't keep P.idx that we didn't need). And given my general feelings towards the dumb-http protocol, maybe that is the right place to stop. It just feels like we _could_ be improving things with a better managed .idx cache system. And that would fix the regression at the same time. But I'm not sure it's worth sinking time and complexity into this terribly inefficient protocol. (Part of why I laid all this out is that until wrote all of the above, I wasn't completely convinced that case (a) was the only one that suffered, and that it was so mild). -Peff