On Fri, Jan 05, 2024 at 12:41:42AM -0500, Jeff King wrote: > This fixes a regression introduced in ac6d45d11f (commit-graph: move > slab-clearing to close_commit_graph(), 2023-10-03), in which running: > > git -c fetch.writeCommitGraph=true fetch --recurse-submodules > > multiple times in a freshly cloned repository causes a segfault. What > happens in the second (and subsequent) runs is this: > > 1. We make a "struct commit" for any ref tips which we're storing > (even if we already have them, they still go into FETCH_HEAD). > > Because the first run will have created a commit graph, we'll find > those commits in the graph. > > The commit struct is therefore created with a NULL "maybe_tree" > entry, because we can load its oid from the graph later. But to do > that we need to remember that we got the commit from the graph, > which is recorded in a global commit_graph_data_slab object. > > 2. Because we're using --recurse-submodules, we'll try to fetch each > of the possible submodules. That implies creating a separate > "struct repository" in-process for each submodule, which will > require a later call to repo_clear(). > > The call to repo_clear() calls raw_object_store_clear(), which in > turn calls close_object_store(), which in turn calls > close_commit_graph(). And the latter frees the commit graph data > slab. > > 3. Later, when trying to write out a new commit graph, we'll ask for > their tree oid via get_commit_tree_oid(), which will see that the > object is parsed but with a NULL maybe_tree field. We'd then > usually pull it from the graph file, but because the slab was > cleared, we don't realize that we can do so! We end up returning > NULL and segfaulting. > > (It seems questionable that we'd write a graph entry for such a > commit anyway, since we know we already have one. I didn't > double-check, but that may simply be another side effect of having > cleared the slab). Yeah, I agree that we should not be re-writing a commit-graph entry that already exists in an earlier (non-removed) layer of the commit-graph. I haven't thought through all of the details here, but that sounds like a potentially dangerous thing to be doing. > So that fixes the regression in the least-risky way possible. Thanks for the detailed explanation. I think that the fix presented here is a reasonable approach, and is worthwhile to pick up. > I do think there's some fragility here that we might want to follow up > on. We have a global commit_graph_data_slab that contains graph > positions, and our global commit structs depend on the that slab > remaining valid. But close_commit_graph() is just about closing _one_ > object store's graph. So it's dangerous to call that function and clear > the slab without also throwing away any "struct commit" we might have > parsed that depends on it. I definitely agree that there seems like a high risk of another potentially-dangerous bug happening as a result of this area. One thing that sticks out to me is that we have a scope mismatch between the commit structs, the raw_object_store, and the commit slabs. Slabs are tied to the lifetime of the raw_object_store, but the commit objects are tied to the global lifetime of the process. I wonder if it would make sense to have a slab per object-store, and require that you pass both the commit *and* the object-store you're looking it up in as arguments to any slab-related functions. I dunno. I have not put a ton of thought into that ^^ approach either, so let me know if it seems obviously bogus to you or if this is worth looking into further. Thanks, Taylor