On Mon, Oct 30, 2023 at 05:29:33PM -0400, Taylor Blau wrote: > On Mon, Oct 23, 2023 at 01:27:16PM +0200, Patrick Steinhardt wrote: > > Our `lookup_commit_in_graph()` helper tries to look up commits from the > > commit graph and, if it doesn't exist there, falls back to parsing it > > from the object database instead. This is intended to speed up the > > lookup of any such commit that exists in the database. There is an edge > > case though where the commit exists in the graph, but not in the object > > database. To avoid returning such stale commits the helper function thus > > double checks that any such commit parsed from the graph also exists in > > the object database. This makes the function safe to use even when > > commit graphs aren't updated regularly. > > > > We're about to introduce the same pattern into other parts of our code > > base though, namely `repo_parse_commit_internal()`. Here the extra > > sanity check is a bit of a tougher sell: `lookup_commit_in_graph()` was > > a newly introduced helper, and as such there was no performance hit by > > adding this sanity check. If we added `repo_parse_commit_internal()` > > with that sanity check right from the beginning as well, this would > > probably never have been an issue to begin with. But by retrofitting it > > with this sanity check now we do add a performance regression to > > preexisting code, and thus there is a desire to avoid this or at least > > give an escape hatch. > > > > In practice, there is no inherent reason why either of those functions > > should have the sanity check whereas the other one does not: either both > > of them are able to detect this issue or none of them should be. This > > also means that the default of whether we do the check should likely be > > the same for both. To err on the side of caution, we thus rather want to > > make `repo_parse_commit_internal()` stricter than to loosen the checks > > that we already have in `lookup_commit_in_graph()`. > > All well reasoned. I think the most compelling reason is that we're > already doing this extra check in lookup_commit_in_graph(), and having > that be somewhat inconsistent with repo_parse_commit_internal() feels > error-prone to me. > > > The escape hatch is added in the form of a new GIT_COMMIT_GRAPH_PARANOIA > > environment variable that mirrors GIT_REF_PARANOIA. If enabled, which is > > the default, we will double check that commits looked up in the commit > > graph via `lookup_commit_in_graph()` also exist in the object database. > > This same check will also be added in `repo_parse_commit_internal()`. > > Sounds good. > > > Signed-off-by: Patrick Steinhardt <ps@xxxxxx> > > --- > > Documentation/git.txt | 9 +++++++++ > > commit-graph.c | 6 +++++- > > commit-graph.h | 6 ++++++ > > t/t5318-commit-graph.sh | 21 +++++++++++++++++++++ > > 4 files changed, 41 insertions(+), 1 deletion(-) > > > > diff --git a/Documentation/git.txt b/Documentation/git.txt > > index 11228956cd..22c2b537aa 100644 > > --- a/Documentation/git.txt > > +++ b/Documentation/git.txt > > @@ -911,6 +911,15 @@ for full details. > > should not normally need to set this to `0`, but it may be > > useful when trying to salvage data from a corrupted repository. > > > > +`GIT_COMMIT_GRAPH_PARANOIA`:: > > + If this Boolean environment variable is set to false, ignore the > > + case where commits exist in the commit graph but not in the > > + object database. Normally, Git will check whether commits loaded > > + from the commit graph exist in the object database to avoid > > + issues with stale commit graphs, but this check comes with a > > + performance penalty. The default is `1` (i.e., be paranoid about > > + stale commits in the commit graph). > > + > > The first two sentences seem to be flipped. Perhaps: > > When loading a commit object from the commit-graph, Git will perform > an existence check on the object in the ODB before parsing it out of > the commit-graph. The default is "true", which enables the > aforementioned behavior. Setting this to "false" disables the > existential check when parsing commits from a commit-graph. I was modelling this after the text we had in `GIT_REF_PARANOIA`, but I like your version more indeed. I'll massage it a bit to mention _why_ one would want to disable this. > > `GIT_ALLOW_PROTOCOL`:: > > If set to a colon-separated list of protocols, behave as if > > `protocol.allow` is set to `never`, and each of the listed > > diff --git a/commit-graph.c b/commit-graph.c > > index fd2f700b2e..12ec31902e 100644 > > --- a/commit-graph.c > > +++ b/commit-graph.c > > @@ -939,14 +939,18 @@ int repo_find_commit_pos_in_graph(struct repository *r, struct commit *c, > > > > struct commit *lookup_commit_in_graph(struct repository *repo, const struct object_id *id) > > { > > + static int object_paranoia = -1; > > struct commit *commit; > > uint32_t pos; > > > > + if (object_paranoia == -1) > > + object_paranoia = git_env_bool(GIT_COMMIT_GRAPH_PARANOIA, 1); > > + > > I don't think that this is a reroll-able issue, but calling this > variable object_paranoia to store a setting for *graph* paranoia feels > like a good itch to scratch. But obviously not a big deal ;-). Ugh, yeah. I first had the envvar as "GIT_OBJECT_PARANOIA", but discarded that name because I feared that it might become overloaded with semi-related checks. Will fix. > > @@ -821,4 +821,25 @@ test_expect_success 'overflow during generation version upgrade' ' > > ) > > ' > > > > +test_expect_success 'stale commit cannot be parsed when given directly' ' > > + test_when_finished "rm -rf repo" && > > + git init repo && > > + ( > > + cd repo && > > + test_commit A && > > + test_commit B && > > + git commit-graph write --reachable && > > + > > + oid=$(git rev-parse B) && > > + rm .git/objects/"$(test_oid_to_path "$oid")" && > > + > > + # Verify that it is possible to read the commit from the > > + # commit graph when not being paranoid, ... > > + GIT_COMMIT_GRAPH_PARANOIA=false git rev-list B && > > + # ... but parsing the commit when double checking that > > + # it actually exists in the object database should fail. > > + test_must_fail git rev-list -1 B > > Would "cat-file -p" be more direct here than "rev-list -1"? No, because it doesn't use `lookup_commit_in_graph()`. I had to go searching a bit to find something that exposes this inconsistency, and git-rev-list(1) was the easiest one I found. Patrick
Attachment:
signature.asc
Description: PGP signature