Taylor Blau <me@xxxxxxxxxxxx> writes: > > I'm going to work on both those things in the background, but I wanted > > to get the description and RFC out early so that folks could take a look > > and we could decide which approach is best. > > I am a little confused. Here you say that this patch is still in RFC, > but the subject line dropped the RFC present in the first round. What is > the state of this patch's readiness? > > Thanks, > Taylor As Emily said [1], I'll be taking over driving this patch. The tl;dr is this patch is not ready, so I think you (the interim maintainer) can drop it. This patch strives to avoid marking missing objects as COMPLETE by doing a check in mark_complete(), but deref_without_lazy_fetch() already makes such a check (note how it can return NULL; also its name implies that it knows something about missing objects, namely that it says it won't lazy fetch them) so the question should be: why doesn't it return NULL when an object is missing? It turns out that it first checks the commit graph file and if it's there, then it's considered to be present, so it does not fetch at all. However there are code paths during commit graph writing (executed after every fetch, in builtin/fetch.c) that access the object store directly without going through the commit graph file (search for "object_info" in commit-graph.c), and those functions perform lazy fetches when the object is missing. So there's an infinite loop of "commit graph writer reads X" -> "fetch X" (nothing gets fetched) -> "write new commit graph, as we always do after a fetch" -> "commit graph writer reads X" -> ... So my initial proposal to not mark objects as COMPLETE if they are missing does not work, because they are already not marked as COMPLETE if they are missing. One could say that we should check both the commit graph file and the object store (or, perhaps even better, only the object store) before stating that an object is present, but I think that Git already assumes in some places that a commit is present merely by its presence in the commit graph file, and it's not worth changing this design. One solution to fix this is to make the commit graph writer never lazy-fetch. This closes us off to being able to have missing commits in a partial clone (at least, if we also want to use the commit graph file). This might be a reasonable thing to do - at least, partial clone has been around for a few years and we've not made many concrete steps towards that - but I feel that we shouldn't close ourselves off if there's an alternative. The alternative I'm currently thinking of is to detect if we didn't fetch any packfiles, and if we didn't, don't write a commit graph (and don't GC), much like we do when we have --negotiate-only. (A packfile-less fetch still can cause refs to be rewritten and thus reduce the number of reachable objects, thus enabling a smaller commit graph to be written and some objects to be GC-ed, but I think that this situation still doesn't warrant commit graph writing and/or GC - we can just do those next time.) The main issue is that we don't always know whether a pack is written or not - in particular, if we use something other than "connect" or "stateless-connect" on a remote helper, we won't know if a packfile was sent. We can solve this by (1) only write the commit graph and GC if we know for sure that a packfile was sent, or (2) write the commit graph and GC unless we know for sure that a packfile was not sent. I'm leaning towards (1) because it seems more conceptually coherent even though it is a change of behavior (from auto commit graph and GC to no), because I think that the repos that need scalability the most already use protocol v2 during fetching (which does require "connect" or "stateless-connect" from remote helpers, so we're covered here), but am OK with (2) as well. Feel free to let me know if you have any ideas. In the meantime I'll look at (1). [1] https://lore.kernel.org/git/20241023002806.367082-1-emilyshaffer@xxxxxxxxxx/