On Mon, Oct 23, 2023 at 01:27:16PM +0200, Patrick Steinhardt wrote: > Our `lookup_commit_in_graph()` helper tries to look up commits from the > commit graph and, if it doesn't exist there, falls back to parsing it > from the object database instead. This is intended to speed up the > lookup of any such commit that exists in the database. There is an edge > case though where the commit exists in the graph, but not in the object > database. To avoid returning such stale commits the helper function thus > double checks that any such commit parsed from the graph also exists in > the object database. This makes the function safe to use even when > commit graphs aren't updated regularly. > > We're about to introduce the same pattern into other parts of our code > base though, namely `repo_parse_commit_internal()`. Here the extra > sanity check is a bit of a tougher sell: `lookup_commit_in_graph()` was > a newly introduced helper, and as such there was no performance hit by > adding this sanity check. If we added `repo_parse_commit_internal()` > with that sanity check right from the beginning as well, this would > probably never have been an issue to begin with. But by retrofitting it > with this sanity check now we do add a performance regression to > preexisting code, and thus there is a desire to avoid this or at least > give an escape hatch. > > In practice, there is no inherent reason why either of those functions > should have the sanity check whereas the other one does not: either both > of them are able to detect this issue or none of them should be. This > also means that the default of whether we do the check should likely be > the same for both. To err on the side of caution, we thus rather want to > make `repo_parse_commit_internal()` stricter than to loosen the checks > that we already have in `lookup_commit_in_graph()`. All well reasoned. I think the most compelling reason is that we're already doing this extra check in lookup_commit_in_graph(), and having that be somewhat inconsistent with repo_parse_commit_internal() feels error-prone to me. > The escape hatch is added in the form of a new GIT_COMMIT_GRAPH_PARANOIA > environment variable that mirrors GIT_REF_PARANOIA. If enabled, which is > the default, we will double check that commits looked up in the commit > graph via `lookup_commit_in_graph()` also exist in the object database. > This same check will also be added in `repo_parse_commit_internal()`. Sounds good. > Signed-off-by: Patrick Steinhardt <ps@xxxxxx> > --- > Documentation/git.txt | 9 +++++++++ > commit-graph.c | 6 +++++- > commit-graph.h | 6 ++++++ > t/t5318-commit-graph.sh | 21 +++++++++++++++++++++ > 4 files changed, 41 insertions(+), 1 deletion(-) > > diff --git a/Documentation/git.txt b/Documentation/git.txt > index 11228956cd..22c2b537aa 100644 > --- a/Documentation/git.txt > +++ b/Documentation/git.txt > @@ -911,6 +911,15 @@ for full details. > should not normally need to set this to `0`, but it may be > useful when trying to salvage data from a corrupted repository. > > +`GIT_COMMIT_GRAPH_PARANOIA`:: > + If this Boolean environment variable is set to false, ignore the > + case where commits exist in the commit graph but not in the > + object database. Normally, Git will check whether commits loaded > + from the commit graph exist in the object database to avoid > + issues with stale commit graphs, but this check comes with a > + performance penalty. The default is `1` (i.e., be paranoid about > + stale commits in the commit graph). > + The first two sentences seem to be flipped. Perhaps: When loading a commit object from the commit-graph, Git will perform an existence check on the object in the ODB before parsing it out of the commit-graph. The default is "true", which enables the aforementioned behavior. Setting this to "false" disables the existential check when parsing commits from a commit-graph. > `GIT_ALLOW_PROTOCOL`:: > If set to a colon-separated list of protocols, behave as if > `protocol.allow` is set to `never`, and each of the listed > diff --git a/commit-graph.c b/commit-graph.c > index fd2f700b2e..12ec31902e 100644 > --- a/commit-graph.c > +++ b/commit-graph.c > @@ -939,14 +939,18 @@ int repo_find_commit_pos_in_graph(struct repository *r, struct commit *c, > > struct commit *lookup_commit_in_graph(struct repository *repo, const struct object_id *id) > { > + static int object_paranoia = -1; > struct commit *commit; > uint32_t pos; > > + if (object_paranoia == -1) > + object_paranoia = git_env_bool(GIT_COMMIT_GRAPH_PARANOIA, 1); > + I don't think that this is a reroll-able issue, but calling this variable object_paranoia to store a setting for *graph* paranoia feels like a good itch to scratch. But obviously not a big deal ;-). > @@ -821,4 +821,25 @@ test_expect_success 'overflow during generation version upgrade' ' > ) > ' > > +test_expect_success 'stale commit cannot be parsed when given directly' ' > + test_when_finished "rm -rf repo" && > + git init repo && > + ( > + cd repo && > + test_commit A && > + test_commit B && > + git commit-graph write --reachable && > + > + oid=$(git rev-parse B) && > + rm .git/objects/"$(test_oid_to_path "$oid")" && > + > + # Verify that it is possible to read the commit from the > + # commit graph when not being paranoid, ... > + GIT_COMMIT_GRAPH_PARANOIA=false git rev-list B && > + # ... but parsing the commit when double checking that > + # it actually exists in the object database should fail. > + test_must_fail git rev-list -1 B Would "cat-file -p" be more direct here than "rev-list -1"? Thanks, Taylor