At $DAYJOB, we have used a custom 'ahead-behind' builtin in our fork of Git for lots of reasons. The main goal of the builtin is to compare multiple references against a common base reference. The comparison is number of commits that are in each side of the symmtric difference of their reachable sets. A commit C is "ahead" of a commit B by the number of commits in B..C (reachable from C but not reachable from B). Similarly, the commit C is "behind" the commit B by the number of commits in C..B (reachable from B but not reachable from C). These numbers can be computed by 'git rev-list --count B..C' and 'git rev-list --count C..B', but there are common needs that benefit from having the checks being done in the same process: 1. Our "branches" page lists ahead/behind counts for each listed branch as compared to the repo's default branch. This can be done with a single 'git ahead-behind' process. 2. When a branch is updated, a background job checks if any pull requests that target that branch should be closed because their branches were merged implicitly by that update. These queries can e batched into 'git ahead-behind' calls. In that second example, we don't need the full ahead/behind counts (although it is sufficient to look for branches that are "zero commits ahead", meaning they are reachable from the base), and instead reachability is the critical piece. This series contributes the custom algorithms we used for our 'git ahead-behind' builtin, but as extensions to 'git for-each-ref': * Add a new "%(ahead-behind:)" format token to for-each-ref which allows outputting the ahead/behind values in the format string for a matching ref. * Add a new algorithm that speeds up the 'git for-each-ref --merged=' option. This also applies to the 'git branch --merged=' option. The idea to use 'git for-each-ref' instead of creating a new builtin is from Junio, and simplifies this series significantly compared to v1. I was initially concerned about the overhead of 'git for-each-ref' and its generality and sorting, but I was not able to measure any important difference between this implementation and our internal 'git ahead-behind' implementation. In particular, when a pattern is given to 'git for-each-ref' that looks like an exact ref, it navigates directly to the ref instead of scanning all references for matches. However, for our specific uses, we like to batch a list of exact references that could be very long. We introduce a new --stdin option here. To keep things close to the v1 outline, I replaced the existing patches with closely-related ones, when possible. Patch 1 adds the --stdin option to 'git for-each-ref'. (This is similar to the boilerplate patch from v1.) Patch 2 adds a test to explicitly check that 'git for-each-ref' will still succeed when all input refs are missing. (This is similar to the --ignore-missing patch from v1.) Patches 3-5 introduce a new method: ensure_generations_valid(). Patch 3 does some refactoring of the existing generation number computations to make it more generic, and patch 4 updates the definition of commit_graph_generation() slightly, making way for patch 5 to implement the method. With an existing commit-graph file, the commits that are not present in the file are considered as having generation number "infinity". This is useful for most of our reachability queries to this point, since those commits are "above" the ones tracked by the commit-graph. When these commits are low in number, then there is very little performance cost and zero correctness cost. (These patches match v1 exactly.) However, we will see that the ahead/behind computation requires accurate generation numbers to avoid overcounting. Thus, ensure_generations_valid() is a way to specify a list of commits that need generation numbers computed before continuing. It's a no-op if all of those commits are in the commit-graph file. It's expensive if the commit-graph doesn't exist. However, '%(ahead-behind:)' computations are likely to be slow no matter what without a commit-graph, so assuming an existing commit-graph file is reasonable. If we find sufficient desire to have an implementation that does not have this requirement, we could create a second implementation and toggle to it when generation_numbers_enabled() returns false. Patch 6 implements the ahead-behind algorithm, but it is not connected to a builtin. It's a long commit message, so hopefully it explains the algorithm sufficiently. (The difference from v1 is that it no longer integrates with a builtin and there are no new tests. It also uses 'unsigned int' and is correctly co-authored by Taylor.) Patch 7 integrates the ahead-behind algorithm with the ref-filter code, including parsing the "ahead-behind" token. This finally adds tests that check both ahead_behind() and ensure_generations_valid() via t6600-test-reach.sh. (This patch is essentially completely new in v2.) Patch 8 implements the tips_reachable_from_base() method, and uses it within the ref-filter code to speed up 'git for-each-ref --merged' and 'git branch --merged'. (The interface is slightly different than v1, due to the needs of the new caller.) Updates in v3 ============= * The APIs are modified to take a 'struct repository *' and use them appropriately. * The --stdin option in 'git for-each-ref' now uses strvec instead of an ad-hoc array. Thanks, -Stolee Derrick Stolee (7): for-each-ref: add --stdin option for-each-ref: explicitly test no matches commit-graph: combine generation computations commit-graph: return generation from memory commit-reach: implement ahead_behind() logic for-each-ref: add ahead-behind format atom commit-reach: add tips_reachable_from_bases() Taylor Blau (1): commit-graph: introduce `ensure_generations_valid()` Documentation/git-for-each-ref.txt | 12 +- builtin/branch.c | 1 + builtin/for-each-ref.c | 24 +++- builtin/tag.c | 1 + commit-graph.c | 208 ++++++++++++++++++---------- commit-graph.h | 8 ++ commit-reach.c | 209 +++++++++++++++++++++++++++++ commit-reach.h | 40 ++++++ ref-filter.c | 91 ++++++++++--- ref-filter.h | 26 +++- t/perf/p1500-graph-walks.sh | 50 +++++++ t/t3203-branch-output.sh | 14 ++ t/t5318-commit-graph.sh | 2 +- t/t6300-for-each-ref.sh | 50 +++++++ t/t6301-for-each-ref-errors.sh | 14 ++ t/t6600-test-reach.sh | 169 +++++++++++++++++++++++ t/t7004-tag.sh | 28 ++++ 17 files changed, 858 insertions(+), 89 deletions(-) create mode 100755 t/perf/p1500-graph-walks.sh base-commit: 725f57037d81e24eacfda6e59a19c60c0b4c8062 Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1489%2Fderrickstolee%2Fstolee%2Fupstream-ahead-behind-v3 Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1489/derrickstolee/stolee/upstream-ahead-behind-v3 Pull-Request: https://github.com/gitgitgadget/git/pull/1489 Range-diff vs v2: 1: a1d9e0f6ff6 ! 1: f9e80e233f1 for-each-ref: add --stdin option @@ Commit message list is interpreted as the complete ref set. When reading from stdin, we populate the filter.name_patterns array - dynamically as opposed to pointing to the 'argv' array directly. This - requires a careful cast while freeing the individual strings, - conditioned on the --stdin option. + dynamically as opposed to pointing to the 'argv' array directly. This is + simple when using a strvec, as it is NULL-terminated in the same way. We + then free the memory directly from the strvec. + Helped-by: Phillip Wood <phillip.wood123@xxxxxxxxx> Signed-off-by: Derrick Stolee <derrickstolee@xxxxxxxxxx> ## Documentation/git-for-each-ref.txt ## @@ Documentation/git-for-each-ref.txt: OPTIONS `<pattern>`. This option makes it stop after showing ## builtin/for-each-ref.c ## +@@ + #include "object.h" + #include "parse-options.h" + #include "ref-filter.h" ++#include "strvec.h" + + static char const * const for_each_ref_usage[] = { + N_("git for-each-ref [<options>] [<pattern>]"), @@ builtin/for-each-ref.c: int cmd_for_each_ref(int argc, const char **argv, const char *prefix) struct ref_format format = REF_FORMAT_INIT; struct strbuf output = STRBUF_INIT; struct strbuf err = STRBUF_INIT; + int from_stdin = 0; ++ struct strvec vec = STRVEC_INIT; struct option opts[] = { OPT_BIT('s', "shell", &format.quote_style, @@ builtin/for-each-ref.c: int cmd_for_each_ref(int argc, const char **argv, const - filter.name_patterns = argv; + if (from_stdin) { + struct strbuf line = STRBUF_INIT; -+ size_t nr = 0, alloc = 16; + + if (argv[0]) + die(_("unknown arguments supplied with --stdin")); + -+ CALLOC_ARRAY(filter.name_patterns, alloc); -+ -+ while (strbuf_getline(&line, stdin) != EOF) { -+ ALLOC_GROW(filter.name_patterns, nr + 1, alloc); -+ filter.name_patterns[nr++] = strbuf_detach(&line, NULL); -+ } ++ while (strbuf_getline(&line, stdin) != EOF) ++ strvec_push(&vec, line.buf); + -+ /* Add a terminating NULL string. */ -+ ALLOC_GROW(filter.name_patterns, nr + 1, alloc); -+ filter.name_patterns[nr + 1] = NULL; ++ /* vec.v is NULL-terminated, just like 'argv'. */ ++ filter.name_patterns = vec.v; + } else { + filter.name_patterns = argv; + } @@ builtin/for-each-ref.c: int cmd_for_each_ref(int argc, const char **argv, const free_commit_list(filter.with_commit); free_commit_list(filter.no_commit); ref_sorting_release(sorting); -+ if (from_stdin) { -+ for (size_t i = 0; filter.name_patterns[i]; i++) -+ free((char *)filter.name_patterns[i]); -+ free(filter.name_patterns); -+ } ++ strvec_clear(&vec); return 0; } 2: 2f162a2f39f = 2: f56d6a64d24 for-each-ref: explicitly test no matches 3: db28e82d2a6 = 3: 3b15e9df770 commit-graph: combine generation computations 4: 3cf33801443 = 4: abd3e7a67be commit-graph: return generation from memory 5: 34dffd836b1 ! 5: e197bddcace commit-graph: introduce `ensure_generations_valid()` @@ commit-graph.c: static void compute_generation_numbers(struct write_commit_graph + * After this method, all commits reachable from those in the given + * list will have non-zero, non-infinite generation numbers. + */ -+void ensure_generations_valid(struct commit **commits, size_t nr) ++void ensure_generations_valid(struct repository *r, ++ struct commit **commits, size_t nr) +{ -+ struct repository *r = the_repository; + int generation_version = get_configured_generation_version(r); + struct packed_commit_list list = { + .list = commits, @@ commit-graph.h: struct commit_graph_data { + * After this method, all commits reachable from those in the given + * list will have non-zero, non-infinite generation numbers. + */ -+void ensure_generations_valid(struct commit **commits, size_t nr); ++void ensure_generations_valid(struct repository *r, ++ struct commit **commits, size_t nr); + #endif 6: 9831c23eadb ! 6: 0fb3913810b commit-reach: implement ahead_behind() logic @@ commit-reach.c: struct commit_list *get_reachable_subset(struct commit **from, i + *bitmap = NULL; +} + -+void ahead_behind(struct commit **commits, size_t commits_nr, ++void ahead_behind(struct repository *r, ++ struct commit **commits, size_t commits_nr, + struct ahead_behind_count *counts, size_t counts_nr) +{ -+ struct prio_queue queue = { compare_commits_by_gen_then_commit_date }; ++ struct prio_queue queue = { .compare = compare_commits_by_gen_then_commit_date }; + size_t width = (commits_nr + BITS_IN_EWORD - 1) / BITS_IN_EWORD; -+ size_t i; + + if (!commits_nr || !counts_nr) + return; + -+ for (i = 0; i < counts_nr; i++) { ++ for (size_t i = 0; i < counts_nr; i++) { + counts[i].ahead = 0; + counts[i].behind = 0; + } + -+ ensure_generations_valid(commits, commits_nr); ++ ensure_generations_valid(r, commits, commits_nr); + + init_bit_arrays(&bit_arrays); + -+ for (i = 0; i < commits_nr; i++) { ++ for (size_t i = 0; i < commits_nr; i++) { + struct commit *c = commits[i]; + struct bitmap *bitmap = init_bit_array(c, width); + @@ commit-reach.c: struct commit_list *get_reachable_subset(struct commit **from, i + struct commit_list *p; + struct bitmap *bitmap_c = init_bit_array(c, width); + -+ for (i = 0; i < counts_nr; i++) { ++ for (size_t i = 0; i < counts_nr; i++) { + int reach_from_tip = !!bitmap_get(bitmap_c, counts[i].tip_index); + int reach_from_base = !!bitmap_get(bitmap_c, counts[i].base_index); + @@ commit-reach.c: struct commit_list *get_reachable_subset(struct commit **from, i + for (p = c->parents; p; p = p->next) { + struct bitmap *bitmap_p; + -+ parse_commit(p->item); ++ repo_parse_commit(r, p->item); + + bitmap_p = init_bit_array(p->item, width); + bitmap_or(bitmap_p, bitmap_c); @@ commit-reach.c: struct commit_list *get_reachable_subset(struct commit **from, i + } + + /* STALE is used here, PARENT2 is used by insert_no_dup(). */ -+ repo_clear_commit_marks(the_repository, PARENT2 | STALE); ++ repo_clear_commit_marks(r, PARENT2 | STALE); + clear_bit_arrays(&bit_arrays); + clear_prio_queue(&queue); +} @@ commit-reach.h: struct commit_list *get_reachable_subset(struct commit **from, i + * Given an array of commits and an array of ahead_behind_count pairs, + * compute the ahead/behind counts for each pair. + */ -+void ahead_behind(struct commit **commits, size_t commits_nr, ++void ahead_behind(struct repository *r, ++ struct commit **commits, size_t commits_nr, + struct ahead_behind_count *counts, size_t counts_nr); + #endif 7: 82dd6f44a33 ! 7: 59cf6759e60 for-each-ref: add ahead-behind format atom @@ Commit message around commits_nr in the second loop of filter_ahead_behind(). Also, the test in t7004 is carefully located to avoid being dependent on the GPG prereq. It also avoids using the test_commit helper, as that will add - ticks to the time and disrupt the expected timestampes in later tag + ticks to the time and disrupt the expected timestamps in later tag tests. Also add performance tests in a new p1300-graph-walks.sh script. This @@ Documentation/git-for-each-ref.txt: worktreepath:: out, if it is checked out in any linked worktree. Empty string otherwise. -+ahead-behind:<ref>:: ++ahead-behind:<committish>:: + Two integers, separated by a space, demonstrating the number of + commits ahead and behind, respectively, when comparing the output -+ ref to the `<ref>` specified in the format. ++ ref to the `<committish>` specified in the format. + In addition to the above, for commit and tag objects, the header field names (`tree`, `parent`, `object`, `type`, and `tag`) can @@ builtin/branch.c: static void print_ref_list(struct ref_filter *filter, struct r if (verify_ref_format(format)) die(_("unable to parse format string")); -+ filter_ahead_behind(format, &array); ++ filter_ahead_behind(the_repository, format, &array); ref_array_sort(sorting, &array); for (i = 0; i < array.nr; i++) { ## builtin/for-each-ref.c ## @@ - #include "object.h" #include "parse-options.h" #include "ref-filter.h" + #include "strvec.h" +#include "commit-reach.h" static char const * const for_each_ref_usage[] = { @@ builtin/for-each-ref.c: int cmd_for_each_ref(int argc, const char **argv, const filter.match_as_path = 1; filter_refs(&array, &filter, FILTER_REFS_ALL); -+ filter_ahead_behind(&format, &array); ++ filter_ahead_behind(the_repository, &format, &array); + ref_array_sort(sorting, &array); @@ builtin/tag.c: static int list_tags(struct ref_filter *filter, struct ref_sortin die(_("unable to parse format string")); filter->with_commit_tag_algo = 1; filter_refs(&array, filter, FILTER_REFS_TAGS); -+ filter_ahead_behind(format, &array); ++ filter_ahead_behind(the_repository, format, &array); ref_array_sort(sorting, &array); for (i = 0; i < array.nr; i++) { @@ ref-filter.c: static void reach_filter(struct ref_array *array, free(to_clear); } -+void filter_ahead_behind(struct ref_format *format, ++void filter_ahead_behind(struct repository *r, ++ struct ref_format *format, + struct ref_array *array) +{ + struct commit **commits; @@ ref-filter.c: static void reach_filter(struct ref_array *array, + commits_nr++; + } + -+ ahead_behind(commits, commits_nr, array->counts, array->counts_nr); ++ ahead_behind(r, commits, commits_nr, array->counts, array->counts_nr); + free(commits); +} + @@ ref-filter.h: struct ref_array_item *ref_array_push(struct ref_array *array, + * + * If this is not called, then any ahead-behind atoms will be blank. + */ -+void filter_ahead_behind(struct ref_format *format, ++void filter_ahead_behind(struct repository *r, ++ struct ref_format *format, + struct ref_array *array); + #endif /* REF_FILTER_H */ @@ t/t6301-for-each-ref-errors.sh: test_expect_success 'Missing objects are reporte +test_expect_success 'ahead-behind requires an argument' ' + test_must_fail git for-each-ref \ + --format="%(ahead-behind)" 2>err && -+ grep "expected format: %(ahead-behind:<ref>)" err ++ echo "fatal: expected format: %(ahead-behind:<ref>)" >expect && ++ test_cmp expect err +' + +test_expect_success 'missing ahead-behind base' ' + test_must_fail git for-each-ref \ + --format="%(ahead-behind:refs/heads/missing)" 2>err && -+ grep "failed to find '\''refs/heads/missing'\''" err ++ echo "fatal: failed to find '\''refs/heads/missing'\''" >expect && ++ test_cmp expect err +' + test_done @@ t/t7004-tag.sh: test_expect_success 'annotations for blobs are empty' ' + git commit --allow-empty -m "right" && + git tag -a -m left tag-right && + -+ # Use " !" at the end to demonstrate whitepsace ++ # Use " !" at the end to demonstrate whitespace + # around empty ahead-behind token for tag-blob. + cat >expect <<-EOF && + refs/tags/tag-blob ! 8: f3fb6833bd7 ! 8: 7476a39331e commit-reach: add tips_reachable_from_bases() @@ Commit message Signed-off-by: Derrick Stolee <derrickstolee@xxxxxxxxxx> ## commit-reach.c ## -@@ commit-reach.c: void ahead_behind(struct commit **commits, size_t commits_nr, +@@ commit-reach.c: void ahead_behind(struct repository *r, clear_bit_arrays(&bit_arrays); clear_prio_queue(&queue); } @@ commit-reach.c: void ahead_behind(struct commit **commits, size_t commits_nr, + return 0; +} + -+void tips_reachable_from_bases(struct commit_list *bases, ++void tips_reachable_from_bases(struct repository *r, ++ struct commit_list *bases, + struct commit **tips, size_t tips_nr, + int mark) +{ -+ size_t i; + struct commit_and_index *commits; -+ unsigned int min_generation_index = 0; ++ size_t min_generation_index = 0; + timestamp_t min_generation; + struct commit_list *stack = NULL; + @@ commit-reach.c: void ahead_behind(struct commit **commits, size_t commits_nr, + + CALLOC_ARRAY(commits, tips_nr); + -+ for (i = 0; i < tips_nr; i++) { ++ for (size_t i = 0; i < tips_nr; i++) { + commits[i].commit = tips[i]; + commits[i].index = i; + commits[i].generation = commit_graph_generation(tips[i]); @@ commit-reach.c: void ahead_behind(struct commit **commits, size_t commits_nr, + min_generation = commits[0].generation; + + while (bases) { -+ parse_commit(bases->item); ++ repo_parse_commit(r, bases->item); + commit_list_insert(bases->item, &stack); + bases = bases->next; + } + + while (stack) { -+ unsigned int j; + int explored_all_parents = 1; + struct commit_list *p; + struct commit *c = stack->item; + timestamp_t c_gen = commit_graph_generation(c); + + /* Does it match any of our tips? */ -+ for (j = min_generation_index; j < tips_nr; j++) { ++ for (size_t j = min_generation_index; j < tips_nr; j++) { + if (c_gen < commits[j].generation) + break; + @@ commit-reach.c: void ahead_behind(struct commit **commits, size_t commits_nr, + } + + for (p = c->parents; p; p = p->next) { -+ parse_commit(p->item); ++ repo_parse_commit(r, p->item); + + /* Have we already explored this parent? */ + if (p->item->object.flags & SEEN) @@ commit-reach.c: void ahead_behind(struct commit **commits, size_t commits_nr, + +done: + free(commits); -+ repo_clear_commit_marks(the_repository, SEEN); ++ repo_clear_commit_marks(r, SEEN); +} ## commit-reach.h ## -@@ commit-reach.h: struct ahead_behind_count { - void ahead_behind(struct commit **commits, size_t commits_nr, +@@ commit-reach.h: void ahead_behind(struct repository *r, + struct commit **commits, size_t commits_nr, struct ahead_behind_count *counts, size_t counts_nr); +/* + * For all tip commits, add 'mark' to their flags if and only if they + * are reachable from one of the commits in 'bases'. + */ -+void tips_reachable_from_bases(struct commit_list *bases, ++void tips_reachable_from_bases(struct repository *r, ++ struct commit_list *bases, + struct commit **tips, size_t tips_nr, + int mark); + @@ ref-filter.c: static void reach_filter(struct ref_array *array, - revs.limited = 1; - if (prepare_revision_walk(&revs)) - die(_("revision walk setup failed")); -+ tips_reachable_from_bases(check_reachable, ++ tips_reachable_from_bases(the_repository, ++ check_reachable, + to_clear, array->nr, + UNINTERESTING); -- gitgitgadget