(I missed ccing reviewers from last round; adding now, although I know Junio is vacationing. Sorry about that.) On Tue, Oct 22, 2024 at 5:28 PM Emily Shaffer <emilyshaffer@xxxxxxxxxx> wrote: > > When fetching, we decide which objects to skip asking for marking > certain commit IDs with the COMPLETE flag. This flag is set in > fetch-pack.c:mark_complete(), which is called from a few different > functions who decide what to mark or not mark. mark_complete() is > insulated against null pointer deref and repeatedly writing the same > commit to the list of objects to skip; because it's the central function > which decides that an object is COMPLETE and doesn't need to be fetched, > let's also insulate it against corruption where the object is not > present (even though we think it is). > > Without this check, it's possible to reach a corrupted state where > fetches can infinitely loop because we decide to skip fetching, but we > don't actually have the skipped commit in our object store. > > This manifested at $DAYJOB in a repo with the following features: > * blob-filtered partial clone enabled > * commit graph enabled > * ref Foo pointing to commit object 6aaaca > * object 6aaaca missing[1] > > With these prerequisites, we noticed that `git fetch` in the repo > produced an infinite loop: > 1. `git fetch` tries to fetch, but thinks it has all objects, so it > noops. > 2. At the end of cmd_fetch(), we try to write_commit_graph_reachable(). > 3. write_commit_graph_reachable() does a reachability walk, including > starting from Foo > 4. The reachability walk tries to peel Foo, and notices it's missing > 6aaaca. > 5. The partial clone machinery asks for a per-object JIT fetch of > 6aaaca. > 6. `git fetch` (child process) is asked to fetch 6aaaca. > 7. We put together the_repository->parsed_objects, adding all commit IDs > reachable from local refs to it > (fetch-pack.c:mark_complete_and_common_refs(), trace region > mark_complete_local_refs). We see Foo, so we add 6aaaca to > the_repository->parsed_objects and mark it as COMPLETE. > 8. cmd_fetch notices that the object ID it was asked for is already > known, so does not fetch anything new. > 9. GOTO 2. > > The culprit is that we're assuming all local refs already must have > objects in place. Let's not assume that, and explicitly check > has_object() before marking objects as COMPLETE. > > NEEDSWORK: It could be valuable to emit a trace event when we try to > mark_complete() an object where we don't actually has_object() (to > understand how often we're having to heal from corruption). > > Signed-off-by: Emily Shaffer <emilyshaffer@xxxxxxxxxx> > Helped-by: Jonathan Tan <jonathantanmy@xxxxxxxxxx> > > 1: That commit object went missing as a byproduct of this partial clone > gc issue: > https://lore.kernel.org/git/20241001191811.1934900-1-calvinwan@xxxxxxxxxx/ > > --- > > > On the list, Junio and Robert Coup suggested that we should notice cases > when the list of object IDs we are trying to fetch gets filtered to > nothing during a lazy fetch. However, this turned out to be quite tricky > to implement - fetch-by-OID stores the OID in a ref object, so it's > challenging to guess that the ref we care about is an OID instead of a > normal ref. It also required some heuristic approach to catch the exact > moment after we removed COMPLETE objects from the list we wanted to > fetch, and to notice that the wrong objects were missing from that list. > > (From the list of alternatives I included with v1, v1 was approach iii; > Junio and Robert suggested iv; this patch holds something close to ii.) > > Jonathan Tan suggested that instead, we could be more careful about what > we mark COMPLETE or not; it seems like this is pretty straightforward to > do. > > I do wonder if it's a layering violation to do the has_object() check in > mark_complete(), which is otherwise a pretty stupid function; I included > it there because all of the other flavors of mark_complete_* eventually > boil down to mark_complete(), so we wouldn't need to remember to check > for object existence any other time we're trying to do this COMPLETE > marking thing. > > Note that I added a test to guarantee this behavior works, but without > commit graph enabled, the test passes both before and after; I guess > maybe it's better to add an explicit regression test for the > combination? But, this test fails with reftable - because we don't have > a way (that I know of) to force-create a ref with a bad ID, to force > this error condition. In the test as written I'm writing to > .git/refs/heads/bar directly; that doesn't work for reftable. But `git > update-ref` is too smart to let me set a ref to garbage. Any tips there > are welcome. I keep thinking about this test, and the more I think, the less valuable I believe it is. I think we aren't super in the habit of writing regression tests, but would it be that valuable to write a regression test in this case instead? On the other hand, I think the code diff is quite obviously a good idea, and we can see from the test suite that there isn't really a performance hit from it. Is it necessary to add a test at all? Or, I guess that we could try to inspect how many fetch attempts were needed to do the JIT fetch in this test. I suspect the number will be too high without this patch - I know it recurses at least more than once. I dunno. Anybody have stronger opinions than me? > > The CI run is at > https://github.com/nasamuffin/git/actions/runs/11470039702 - it seems > the reftable tests are the only things failing. > > Also, I am sending this version, but if there are any additional > comments or it requires more changes, please expect Jonathan Tan to take > over driving this patch the rest of the way for me. As previously > stated[2], I'll be OOO after this Friday for most of the rest of the > year; the rest of this week I'm trying to get the rest of my > non-upstream loose ends tied up, so I won't have time to do another > iteration. See folks around Christmastime :) > > - Emily > > 2: https://lore.kernel.org/git/CAJoAoZnovapqMcu72DGR40jRRqRn57uJVTJg82kZ_rohtGDSfQ@xxxxxxxxxxxxxx/ > > Cover letter from v1 follows: > > There are a few alternative approaches for this issue that I talked > about with some folks at $DAYJOB: > > i. Just disabling the commit graph rewrite allows this to fall > eventually into a path where the fetch actually succeeds. I didn't like > this solution - it's just whack-a-mole - so I didn't look too hard into > why it succeeds that way. It *could* make sense to disable commit graph > rewrite when we do a JIT fetch with blob or tree filter provided - but > if later we want to implement commit filter (something we've talked > about at Google) then I'd worry about this situation coming up again. > > ii. We could decide not to mark local refs (and commits reachable from > them) as COMPLETE in the_repository->parsed_objects. I didn't try this > solution out, and I'm not sure what the performance implications are, > but Jonathan Tan likes this solution, so I may try it out and see what > breaks shortly. > > iii. We could do all the JIT fetches with --refetch. In my opinion, this > is the safest/most self-healing solution; the JIT fetch only happens > when we really know we're missing the object, so it doesn't make sense > for that fetch to be canceled by any cache. It doesn't have performance > implications as far as I can guess (except that I think we still build > the parsed_objects hash even though we are going to ignore it, but we > already were doing that anyway). Of course, that's what this patch does. > > iv. We could do nothing; when cmd_fetch gets a fetch-by-object-id but > decides there is nothing more to do, it could terminate with an error. > That should stop the infinite recursion, and the error could suggest the > user to run `git fsck` and discover what the problem is. Depending on > the remediation we suggest, though, I think a direct fetch to fix this > particular loop would not work. > > I'm curious to hear thoughts from people who are more expert than me on > partial clone and fetching in general, though. > > This change is also still in RFC, for two reasons: > > First, it's intermittently failing tests for me locally, in weirdly > flaky ways: > > - t0410-partial-clone.sh fails when I run it from prove, but passes when > I run it manually, every time. > - t5601-clone.sh and t5505-remote.sh fail nonsensically on `rm -rf` that > should succeed (and does succeed if I stop the test with test_pause), > which makes me think there's something else borked in my setup, but > I'm not sure what. > - t5616-partial-clone.sh actually does fail in a way that I could see > having to do with this change (since I guess we might download more > packs than usual), but I was so confused by the other two errors I > haven't looked closely yet. > > And secondly, I didn't write tests verifying the breakage and that this > change fixes it yet, either. > > I'm going to work on both those things in the background, but I wanted > to get the description and RFC out early so that folks could take a look > and we could decide which approach is best. > > Thanks, > - Emily > --- > fetch-pack.c | 4 +++- > t/t0410-partial-clone.sh | 30 ++++++++++++++++++++++++++++++ > 2 files changed, 33 insertions(+), 1 deletion(-) > > diff --git a/fetch-pack.c b/fetch-pack.c > index f752da93a8..8cb2ce4c54 100644 > --- a/fetch-pack.c > +++ b/fetch-pack.c > @@ -603,7 +603,9 @@ static int mark_complete(const struct object_id *oid) > { > struct commit *commit = deref_without_lazy_fetch(oid, 1); > > - if (commit && !(commit->object.flags & COMPLETE)) { > + if (commit && > + !(commit->object.flags & COMPLETE) && > + has_object(the_repository, oid, 0)) { > commit->object.flags |= COMPLETE; > commit_list_insert(commit, &complete); > } > diff --git a/t/t0410-partial-clone.sh b/t/t0410-partial-clone.sh > index 818700fbec..95de18ec40 100755 > --- a/t/t0410-partial-clone.sh > +++ b/t/t0410-partial-clone.sh > @@ -241,6 +241,36 @@ test_expect_success 'fetching of missing objects works with ref-in-want enabled' > grep "fetch< fetch=.*ref-in-want" trace > ' > > +test_expect_success 'fetching missing objects pointed to by a local ref' ' > + rm -rf reliable-server unreliable-client && > + test_when_finished rm -rf reliable-server unreliable-client && > + test_create_repo reliable-server && > + git -C reliable-server config uploadpack.allowanysha1inwant 1 && > + git -C reliable-server config uploadpack.allowfilter 1 && > + test_commit -C reliable-server foo && > + > + git clone --filter=blob:none "file://$(pwd)/reliable-server" unreliable-client && > + > + # to simulate the unreliable client losing a referenced object by > + # corruption, create the object on the server side, then create only a > + # reference to that object on the client side (without providing the > + # object itself). > + test_commit -C reliable-server bar && > + HASH=$(git -C reliable-server rev-parse HEAD) && > + echo "$HASH" >unreliable-client/.git/refs/heads/bar && > + > + # the object is really missing > + # check if we can rev-parse a partial SHA. partial so we do not fetch it, > + # but barely partial (trim only the last char) so that we do not collide > + test_must_fail git -C unreliable-client rev-parse ${HASH%%?} && > + > + # trigger a remote fetch by checking out `bar` > + git -C unreliable-client switch bar && > + > + # and now we have the missing object > + git -C unreliable-client rev-parse ${HASH%%?} > +' > + > test_expect_success 'fetching of missing objects from another promisor remote' ' > git clone "file://$(pwd)/server" server2 && > test_commit -C server2 bar && > > Range-diff against v1: > 1: 092be0a655 ! 1: 4db6bbb4cd promisor-remote: always JIT fetch with --refetch > - ## promisor-remote.c ## > -@@ promisor-remote.c: static int fetch_objects(struct repository *repo, > - strvec_pushl(&child.args, "-c", "fetch.negotiationAlgorithm=noop", > - "fetch", remote_name, "--no-tags", > - "--no-write-fetch-head", "--recurse-submodules=no", > -- "--filter=blob:none", "--stdin", NULL); > -+ "--filter=blob:none", "--refetch", "--stdin", NULL); > - if (!git_config_get_bool("promisor.quiet", &quiet) && quiet) > - strvec_push(&child.args, "--quiet"); > - if (start_command(&child)) > + ## fetch-pack.c ## > +@@ fetch-pack.c: static int mark_complete(const struct object_id *oid) > + { > + struct commit *commit = deref_without_lazy_fetch(oid, 1); > + > +- if (commit && !(commit->object.flags & COMPLETE)) { > ++ if (commit && > ++ !(commit->object.flags & COMPLETE) && > ++ has_object(the_repository, oid, 0)) { > + commit->object.flags |= COMPLETE; > + commit_list_insert(commit, &complete); > + } > + > + ## t/t0410-partial-clone.sh ## > +@@ t/t0410-partial-clone.sh: test_expect_success 'fetching of missing objects works with ref-in-want enabled' > + grep "fetch< fetch=.*ref-in-want" trace > + ' > + > ++test_expect_success 'fetching missing objects pointed to by a local ref' ' > ++ rm -rf reliable-server unreliable-client && > ++ test_when_finished rm -rf reliable-server unreliable-client && > ++ test_create_repo reliable-server && > ++ git -C reliable-server config uploadpack.allowanysha1inwant 1 && > ++ git -C reliable-server config uploadpack.allowfilter 1 && > ++ test_commit -C reliable-server foo && > ++ > ++ git clone --filter=blob:none "file://$(pwd)/reliable-server" unreliable-client && > ++ > ++ # to simulate the unreliable client losing a referenced object by > ++ # corruption, create the object on the server side, then create only a > ++ # reference to that object on the client side (without providing the > ++ # object itself). > ++ test_commit -C reliable-server bar && > ++ HASH=$(git -C reliable-server rev-parse HEAD) && > ++ echo "$HASH" >unreliable-client/.git/refs/heads/bar && > ++ > ++ # the object is really missing > ++ # check if we can rev-parse a partial SHA. partial so we do not fetch it, > ++ # but barely partial (trim only the last char) so that we do not collide > ++ test_must_fail git -C unreliable-client rev-parse ${HASH%%?} && > ++ > ++ # trigger a remote fetch by checking out `bar` > ++ git -C unreliable-client switch bar && > ++ > ++ # and now we have the missing object > ++ git -C unreliable-client rev-parse ${HASH%%?} > ++' > ++ > + test_expect_success 'fetching of missing objects from another promisor remote' ' > + git clone "file://$(pwd)/server" server2 && > + test_commit -C server2 bar && > -- > 2.47.0.105.g07ac214952-goog >