Re: [PATCH v2 3/4] t5300: move --window clamp test next to unclamped

Jeff King <peff@xxxxxxxx> · Wed, 13 Nov 2024 19:56:52 -0500

On Wed, Nov 13, 2024 at 10:26:56AM -0800, Jonathan Tan wrote:

> > The reason these window tests are broken is that the earlier "build pack
> > index for an existing pack" is now finding and storing deltas in a new
> > pack when it does this:
> > 
> >   git index-pack --promisor=message test-3.pack &&
> > 
> > But that command is indexing a pack that is not even in the repository's
> > object store at all! Yet it triggers a call to pack-objects that repacks
> > within that object store.
> 
> As far as I know, index-pack, when run as part of fetch, indexes a pack
> that's not in the repository's object store; it indexes a packfile in a
> temp directory. (So I don't think this is a strange thing to do.)

When fetching (or receiving a push), we use "index-pack --stdin" and do
write the resulting pack into the repository (and the command will
complain if there is no repository).

So I think what the test is doing above (using --promisor on a random
packfile) is questionable. It goes back to 1f52cdfacb (index-pack:
document and test the --promisor option, 2022-03-09). I wonder if that
test is actually valuable now that the --promisor option is actually
triggered via fetch.

If we restricted --promisor to work only with --stdin, that would deal
with both of the issues I saw. And I suspect (but didn't dig deeply or
test) that would be sufficient for how it is called within git. (I
wondered briefly if bundles might index in-place, but they seem to use
--stdin also).

Where it gets weirder to me is with quarantine directories (and maybe
this is what you meant above). On receiving a push, we "index --stdin"
into a temporary quarantine directory. If that kicks off a pack-objects
run, where does that pack-objects put its new pack? Within the
quarantined index-pack we set GIT_OBJECT_DIRECTORY to the quarantine and
add the original repo as an alternate. So I _think_ both the pushed-up
pack and the repacked promisor pack would go into the quarantine dir,
and then we'd migrate both (or neither) when we commit to the push.

Which is OK, but I don't know that I thought that far ahead when writing
the quarantine stuff long ago.

It's probably somewhat academic right now, as I'm not sure if you can
even push reliably into a promisor repo (and it doesn't look like
receive-pack knows about passing --promisor anyway). We don't quarantine
on fetch right now, though we have discussed it in the past (and I think
we should consider doing it).

So this may become more real in the future. I wonder if there is a way
to add a test to future-proof against changes to how the quarantine
system works. The theoretical problem case is if we did quarantine
fetches, but accidentally wrote the new promisor pack into the main
repo instead of the quarantine, and then a fetch rejected the incoming
pack (because of a hook, failed connectivity check, etc). Then we'd end
up with the new promisor pack when we shouldn't, which I guess could
move objects from that incoming pack that we rejected into the main
repo, despite the quarantine?

I can't think of a way to test that now, without the quarantine-on-fetch
feature existing.

> > I'd guess in the real world, we'd only pass that option when indexing
> > packs that we just fetched. But as a bystander to this feature, it feels
> > quite odd to me that index-pack, which I generally consider a "read
> > only" operation except for the index it was asked to write, would be
> > creating a new pack like this. I didn't follow the topic closely enough
> > to comment more intelligently, but would it be possible for the caller
> > of index-pack to trigger the repack as an independent step?
> 
> I thought of that, but as far as I know, during a fetch, index-pack is
> the only time in which the objects in the fetched pack are uncompressed
> in memory. There have been concerns about the performance of various
> ways of solving the promisor-object-and-GC bug, so I took an approach
> that minimizes the performance hit as much as possible, by avoiding yet
> another uncompression (we need to uncompress the objects to find their
> outgoing links, so that we know what to repack).

Hmm, yeah. I can see the appeal of doing the processing there. Kicking
off a pack-objects can be similarly expensive in the worst case, but in
practice it should only be dealing with a few new objects (and we want
to cheaply find out what those objects are, if there are even any at
all).

> We definitely should prevent the segfault, but I think that's better
> done by making --promisor only work if we run index-pack from within a
> repo. I don't think we can restrict the repacking to run only if we're
> indexing a pack within the repo, because in our fetch case, we're
> indexing a new pack - not one within the repo.

I think the "--stdin" thing above neatly solves this.

> Maybe we could conceptualize "index-pack --promisor" as the pack
> giving "testimony" about objects that its objects link to, so we can
> update our own records.

Yeah, I guess the fundamental thing here is that anybody who isn't
passing "--promisor" is not going to be affected, so that at least
limits the opportunity for surprise.

The quarantine discussion above is an example of how there could be
unexpected consequences. I _think_ it's OK based on what I wrote, but
hopefully that explains my general feeling of surprise. I dunno. It
still may be the least bad thing.

-Peff