Re: [PATCH 0/3] rev-list: add support for commits in `--missing`

Junio C Hamano <gitster@xxxxxxxxx> · Fri, 13 Oct 2023 10:07:29 -0700

Patrick Steinhardt <ps@xxxxxx> writes:

> actually exists in the object database. It's the only callsite of that
> function outside of "commit-graph.c", as all other external callers
> would call `lookup_commit_in_graph()` which _does_ perform the object
> existence check.
>
> So I think that the proper way to address the regression would be a
> patch similar to the following:
>
> diff --git a/commit.c b/commit.c
> index b3223478bc..109e9217e3 100644
> --- a/commit.c
> +++ b/commit.c
> @@ -572,8 +572,13 @@ int repo_parse_commit_internal(struct repository *r,
>  		return -1;
>  	if (item->object.parsed)
>  		return 0;
> -	if (use_commit_graph && parse_commit_in_graph(r, item))
> +	if (use_commit_graph && parse_commit_in_graph(r, item)) {
> +		if (!has_object(r, &item->object.oid, 0))
> +			return quiet_on_missing ? -1 :
> +				error(_("commit %s exists in commit-graph but not in the object database"),
> +				      oid_to_hex(&item->object.oid));
>  		return 0;
> +	}

It is (overly) conservative, which I generally should find pleasing,
but as I said, for secondary information like commit-graph that is
derived from the primary source only to precompute for optimization,
our general attitude should be to trust it and let the optimization
kick in, unless the operation being performed primarily cares about
the case where the result from using and not using the secondary
source differs.  An obvious example of such an operation is "fsck",
where we do care and want to notice when the underlying object graph
and what commit-graph records contradict with each other.  And my
suggestion to disable commit-graph while running the "rev-list"
command with the "--missing" option is because that usage would fall
into the same category (please correct me if that is not the case) [*].

So for the purpose of "rev-list --missing", the above change does
take us closer to the answer we would get from the primary source,
but isn't it pessimising other users unnecessarily?  Do we have a
rough idea (if we have a benchmark that would be even better) how
much this lack of has_object() check is contributing to the
performance benefit of using commit-graph?  Majority of callers of
this code should not have to pay the additional overhead here, so
unless it is small enough, this would be pessimising the generic
code for everybody to please "rev-list --missing", which is why I am
worried about going in this direction.