Re: [PATCH] commit-graph: retain commit slab when closing NULL commit_graph

Taylor Blau <me@xxxxxxxxxxxx> · Fri, 5 Jan 2024 12:07:44 -0500

On Fri, Jan 05, 2024 at 12:41:42AM -0500, Jeff King wrote:
> This fixes a regression introduced in ac6d45d11f (commit-graph: move
> slab-clearing to close_commit_graph(), 2023-10-03), in which running:
>
>   git -c fetch.writeCommitGraph=true fetch --recurse-submodules
>
> multiple times in a freshly cloned repository causes a segfault. What
> happens in the second (and subsequent) runs is this:
>
>   1. We make a "struct commit" for any ref tips which we're storing
>      (even if we already have them, they still go into FETCH_HEAD).
>
>      Because the first run will have created a commit graph, we'll find
>      those commits in the graph.
>
>      The commit struct is therefore created with a NULL "maybe_tree"
>      entry, because we can load its oid from the graph later. But to do
>      that we need to remember that we got the commit from the graph,
>      which is recorded in a global commit_graph_data_slab object.
>
>   2. Because we're using --recurse-submodules, we'll try to fetch each
>      of the possible submodules. That implies creating a separate
>      "struct repository" in-process for each submodule, which will
>      require a later call to repo_clear().
>
>      The call to repo_clear() calls raw_object_store_clear(), which in
>      turn calls close_object_store(), which in turn calls
>      close_commit_graph(). And the latter frees the commit graph data
>      slab.
>
>   3. Later, when trying to write out a new commit graph, we'll ask for
>      their tree oid via get_commit_tree_oid(), which will see that the
>      object is parsed but with a NULL maybe_tree field. We'd then
>      usually pull it from the graph file, but because the slab was
>      cleared, we don't realize that we can do so! We end up returning
>      NULL and segfaulting.
>
>      (It seems questionable that we'd write a graph entry for such a
>      commit anyway, since we know we already have one. I didn't
>      double-check, but that may simply be another side effect of having
>      cleared the slab).

Yeah, I agree that we should not be re-writing a commit-graph entry that
already exists in an earlier (non-removed) layer of the commit-graph.
I haven't thought through all of the details here, but that sounds like
a potentially dangerous thing to be doing.

> So that fixes the regression in the least-risky way possible.

Thanks for the detailed explanation. I think that the fix presented here
is a reasonable approach, and is worthwhile to pick up.

> I do think there's some fragility here that we might want to follow up
> on. We have a global commit_graph_data_slab that contains graph
> positions, and our global commit structs depend on the that slab
> remaining valid. But close_commit_graph() is just about closing _one_
> object store's graph. So it's dangerous to call that function and clear
> the slab without also throwing away any "struct commit" we might have
> parsed that depends on it.

I definitely agree that there seems like a high risk of another
potentially-dangerous bug happening as a result of this area.

One thing that sticks out to me is that we have a scope mismatch
between the commit structs, the raw_object_store, and the commit slabs.
Slabs are tied to the lifetime of the raw_object_store, but the commit
objects are tied to the global lifetime of the process. I wonder if it
would make sense to have a slab per object-store, and require that you
pass both the commit *and* the object-store you're looking it up in as
arguments to any slab-related functions.

I dunno. I have not put a ton of thought into that ^^ approach either,
so let me know if it seems obviously bogus to you or if this is worth
looking into further.

Thanks,
Taylor