Re: [PATCH v2 0/8] ref-filter: ahead/behind counting, faster --merged option

Junio C Hamano <gitster@xxxxxxxxx> · Fri, 10 Mar 2023 11:16:17 -0800

"Derrick Stolee via GitGitGadget" <gitgitgadget@xxxxxxxxx> writes:

> I was
> initially concerned about the overhead of 'git for-each-ref' and its
> generality and sorting, but I was not able to measure any important
> difference between this implementation and our internal 'git ahead-behind'
> implementation.

That certainly is nice to know.

> However, for our specific uses, we like to batch a list of exact references
> that could be very long. We introduce a new --stdin option here.
>
> To keep things close to the v1 outline, I replaced the existing patches with
> closely-related ones, when possible.
>
> Patch 1 adds the --stdin option to 'git for-each-ref'. (This is similar to
> the boilerplate patch from v1.)
>
> Patch 2 adds a test to explicitly check that 'git for-each-ref' will still
> succeed when all input refs are missing. (This is similar to the
> --ignore-missing patch from v1.)

Sensible.

> Patches 3-5 introduce a new method: ensure_generations_valid(). Patch 3 does
> some refactoring of the existing generation number computations to make it
> more generic, and patch 4 updates the definition of
> commit_graph_generation() slightly, making way for patch 5 to implement the
> method. With an existing commit-graph file, the commits that are not present
> in the file are considered as having generation number "infinity". This is
> useful for most of our reachability queries to this point, since those
> commits are "above" the ones tracked by the commit-graph. When these commits
> are low in number, then there is very little performance cost and zero
> correctness cost. (These patches match v1 exactly.)
>
> However, we will see that the ahead/behind computation requires accurate
> generation numbers to avoid overcounting. Thus, ensure_generations_valid()
> is a way to specify a list of commits that need generation numbers computed
> before continuing. It's a no-op if all of those commits are in the
> commit-graph file. It's expensive if the commit-graph doesn't exist.

Reasonable.

> However, '%(ahead-behind:)' computations are likely to be slow no matter
> what without a commit-graph, so assuming an existing commit-graph file is
> reasonable. If we find sufficient desire to have an implementation that does
> not have this requirement, we could create a second implementation and
> toggle to it when generation_numbers_enabled() returns false.

At that point, it might make sense to find a way to make the work
ensure_generations_valid() had to spend cycles on not to go to
waste.  Something like "ah, you do not have commit-graph at all, so
let's try to create one if you can write into the repository" at the
beginning of the function, or something?  Just thinking aloud.

> Patch 6 implements the ahead-behind algorithm, but it is not connected to a
> builtin. It's a long commit message, so hopefully it explains the algorithm
> sufficiently. (The difference from v1 is that it no longer integrates with a
> builtin and there are no new tests. It also uses 'unsigned int' and is
> correctly co-authored by Taylor.)

Nice.

> Patch 7 integrates the ahead-behind algorithm with the ref-filter code,
> including parsing the "ahead-behind" token. This finally adds tests that
> check both ahead_behind() and ensure_generations_valid() via
> t6600-test-reach.sh. (This patch is essentially completely new in v2.)
>
> Patch 8 implements the tips_reachable_from_base() method, and uses it within
> the ref-filter code to speed up 'git for-each-ref --merged' and 'git branch
> --merged'. (The interface is slightly different than v1, due to the needs of
> the new caller.)

Very nice.

Having read all the patches, I am very impressed and pleased, but
are we losing anything by having the feature inside for-each-ref
compared to a new command ahead-behind?  As far as I can tell, the
new "for-each-ref --stdin" would still want to match refs and work
only on refs, but there shouldn't be any reason for ahead-behind
computation to limit to tips that are at the tip of a ref, so that
may be one downside in this updated design.  For the intended use
case of "let's find which branches are stale", that downside does
not matter in practice, but for other use cases people will think
of in the future, the limitation might matter (at which time we can
easily resurrect the other subcommand, using the internal machinery
we have here, so it is not a huge deal, I presume).

Thanks.