Re: [PATCH] Create '--merges-only' option for 'git bisect'

Johannes Schindelin <Johannes.Schindelin@xxxxxx> · Thu, 12 Apr 2018 13:33:00 +0200 (DST)

Hi Harald,

Thank you for working on this.

On Thu, 12 Apr 2018, Harald Nordgren wrote:

> When ran with '--merges-only', git bisect will only look at merge
> commits -- commits with 2 or more parents or the initial commit.

This is an excellent idea!

>     Proof of concept of a feature that I have wanted in Git for a while.
>     In my daily work my company uses GitHub, which creates lots of merge
>     commits. In general, tests are only ran on the tip of a branch
>     before merging, so the different commits within a merge commit are
>     allowed not to be buildable. Therefore 'git bisect' often doesn't
>     work since it will give lots of false positives for anything that is
>     not a merge commit. If we could have a feature to only bisect merge
>     commits then it would be easier to pinpoint which merge causes any
>     particular issue. After that, a bisect could be done within the
>     merge to pinpoint futher. As a follow-up to this patch -- we could
>     potentially create a feature that automatically continues into
>     regular bisect within the bad merge commit after completed
>     '--merges-only' bisection.

It also helps when bisecting breakages in `pu` (mainly because the
branches in `pu` use branch points that are often insanely far in the
past).

> diff --git a/bisect.c b/bisect.c
> index a579b5088..e8a2f77c7 100644
> --- a/bisect.c
> +++ b/bisect.c
> @@ -254,7 +254,7 @@ static struct commit_list *best_bisection_sorted(struct commit_list *list, int n
>   */
>  static struct commit_list *do_find_bisection(struct commit_list *list,
>  					     int nr, int *weights,
> -					     int find_all)
> +					     int find_all, int only_merge_commits)

How about `int flags`, and defining FIND_ALL and ONLY_MERGE_COMMITS?

That way, it will be easy to add ONLY_FIRST_PARENTS without changing the
signature of do_find_bisection().

Ideally, a preparatory commit would change find_all -> flags (adding
FIND_ALL), and the second commit would then add support for
ONLY_MERGE_COMMITS.

> @@ -266,6 +266,13 @@ static struct commit_list *do_find_bisection(struct commit_list *list,
>  		unsigned flags = commit->object.flags;
>  
>  		p->item->util = &weights[n++];
> +
> +		if (only_merge_commits) {
> +			weight_set(p, -2);
> +			counted++;
> +			continue;
> +		}
> +

This hunk is a little hard to understand when you come from elsewhere, as
I do. Could you maybe explain a little bit what the `weight_set(p, -2)`
does? This is probably good material for enhancing the commit message
(seeing as we understand the commit message to be kind of the "convince me
that this is a good change, and explain things that are not immediately
obvious from the diff" document).

Maybe it would be enough to describe the role of the weight, and what the
typical values look like.

> @@ -305,11 +312,17 @@ static struct commit_list *do_find_bisection(struct commit_list *list,
>  	 * way, and then fill the blanks using cheaper algorithm.
>  	 */
>  	for (p = list; p; p = p->next) {
> +		int distance;
>  		if (p->item->object.flags & UNINTERESTING)
>  			continue;
>  		if (weight(p) != -2)
>  			continue;
> -		weight_set(p, count_distance(p));
> +
> +		if (only_merge_commits)
> +			distance = count_distance(p) - 1;
> +		else
> +			distance = count_distance(p);
> +		weight_set(p, distance);

A shorter way:

		weight_set(p, count_distance(p) - !!only_merge_commits);

Could you add a code comment above this line to explain why this is
needed? (I have to admit that I have no clue what the weights or the
distances are in this context, but I think that a 3-line explanation could
probably give me enough of an idea that I do not have to study an hour or
three to learn enough to understand this change.)

>  		clear_distance(list);
>  
>  		/* Does it happen to be at exactly half-way? */
> @@ -330,11 +343,17 @@ static struct commit_list *do_find_bisection(struct commit_list *list,
>  			for (q = p->item->parents; q; q = q->next) {
>  				if (q->item->object.flags & UNINTERESTING)
>  					continue;
> +				if (!q->item->util)
> +					break;

So we use the `util` field now to do things, probably to flag some
property of the commit.

Maybe the commit message could prepare the reader for this, with a
paragraph starting with "We use the commits' `util` field to store the
information that [...]"?

Seeing as this loop iterates over the commit's parents, I guess we are
looking at merge commits, and drop out early if we find a parent that was
not marked as interesting earlier?

If this is the case, why do we not have to look further, at later parents,
whether they *do* have a non-NULL `util` attribute?

>  				if (0 <= weight(q))
>  					break;
>  			}
>  			if (!q)
>  				continue;
> +			if (!q->item->util) {
> +				counted++;
> +				continue;
> +			}

Okay, so here we skip the current commit if we found a parent commit that
has the `util` field set to NULL...

> @@ -364,8 +383,43 @@ static struct commit_list *do_find_bisection(struct commit_list *list,
>  		return best_bisection_sorted(list, nr);
>  }
>  
> +int merge_commit_or_root(const struct commit c)

By value? Let's use `*c` instead of `c` here.

> +{
> +	if (!c.parents)
> +		return 1;
> +
> +	return !!c.parents->next;

Unless you need the return value to be 0 or 1 (instead of 0 or non-0),
this is conciser:

	return !c.parents || c.parents->next;

However, we should not detect root commits here. We are only interested in
merges, according to the new option! So we should filter out root commits,
too. So this is what you really want:

	int is_merge(const struct commit *c)
	{
		return c->parents && c->parents->next;
	}

> +void filter_non_merge_commits(struct commit_list **commit_list)

So does this keep non_merge_commits (similar to Makefile's $(filter ...)
function?) Or does it remove them?

Judging by the code, maybe "filter_out_non_merge_commits()" or
"drop_regular_commits()"? Or "keep_only_root_and_merge_commits()"?

> +{
> +	struct commit_list *list1 = *commit_list;
> +	struct commit_list *list2 = NULL;
> +	*commit_list = NULL;
> +
> +	for ( ; list1; list1 = list1->next) {
> +		if (merge_commit_or_root(*list1->item)) {
> +			list2 = list1;
> +			list1 = list1->next;
> +			list2->next = NULL;

Since commit_lists are malloc()ed, they also need to be free()d when
filtering something out.

> +			list2->item->parents = NULL;

Why do we touch the parents here? This is a bit dangerous, as the item
refers to the commit that `lookup_commit()` would produce again. And by
setting parents = NULL, we pretend that those commits are root commits
from now on.

> +			*commit_list = list2;
> +			break;

Also, using the non-descriptive names `list1` and `list2` makes code a bit
hard to understand. (And hence it makes it easy for bugs to creep in.)

> +		}
> +	}
> +
> +	for ( ; list1; list1 = list1->next) {

The previous for loop has the condition `list1` already. So at this point
we know that `list1 == NULL`, and therefore this for loop is dead code.

Did you mean `for (list1 = *commit_list; list1; list1 = list1->next)`?

> +		list2->next = NULL;

Where is list2 set? Only in the previous loop, if it found a merge or root
commit. Otherwise, list2 would still be NULL and dereferenced here!

> +		if (merge_commit_or_root(*list1->item)) {
> +			list2->next = list1;
> +			list2 = list2->next;
> +			list2->item->parents = list1;

And here, we set the parents to list1... Again, this is messing with the
commits that come back when `lookup_commit()` is called later.

Changing the parents of a commit is something I imagine breaks too many
assumption to be a safe thing.

I'll read on for a little bit, maybe I get a better understanding what we
need to do in this here function.

> +		}
> +	}
> +}
> +
>  void find_bisection(struct commit_list **commit_list, int *reaches,
> -		    int *all, int find_all)
> +		    int *all, int find_all, int only_merge_commits)

Let's use a flag here, just as above.

>  {
>  	int nr, on_list;
>  	struct commit_list *list, *p, *best, *next, *last;
> @@ -373,6 +427,10 @@ void find_bisection(struct commit_list **commit_list, int *reaches,
>  
>  	show_list("bisection 2 entry", 0, 0, *commit_list);
>  
> +	if (only_merge_commits) {
> +		filter_non_merge_commits(commit_list);
> +	}
> +
>  	/*
>  	 * Count the number of total and tree-changing items on the
>  	 * list, while reversing the list.
> @@ -400,7 +458,7 @@ void find_bisection(struct commit_list **commit_list, int *reaches,
>  	weights = xcalloc(on_list, sizeof(*weights));
>  
>  	/* Do the real work of finding bisection commit. */
> -	best = do_find_bisection(list, nr, weights, find_all);
> +	best = do_find_bisection(list, nr, weights, find_all, only_merge_commits);
>  	if (best) {
>  		if (!find_all) {
>  			list->item = best->item;
> @@ -878,7 +936,7 @@ static void check_good_are_ancestors_of_bad(const char *prefix, int no_checkout)
>  /*
>   * This does "git diff-tree --pretty COMMIT" without one fork+exec.
>   */
> -static void show_diff_tree(const char *prefix, struct commit *commit)
> +static void show_diff_tree(const char *prefix, struct commit *commit, int only_merge_commits)
>  {
>  	struct rev_info opt;
>  
> @@ -893,6 +951,11 @@ static void show_diff_tree(const char *prefix, struct commit *commit)
>  	opt.use_terminator = 0;
>  	opt.commit_format = CMIT_FMT_DEFAULT;
>  
> +	if (only_merge_commits) {
> +		opt.ignore_merges = 0;
> +		opt.combine_merges = 1;
> +	}

This is most likely what we want, whether --only-merge-commits was passed
or not. Let's make it the default (in a separate commit)?

> @@ -938,7 +1001,7 @@ void read_bisect_terms(const char **read_bad, const char **read_good)
>   * If no_checkout is non-zero, the bisection process does not
>   * checkout the trial commit but instead simply updates BISECT_HEAD.
>   */
> -int bisect_next_all(const char *prefix, int no_checkout)
> +int bisect_next_all(const char *prefix, int no_checkout, int only_merge_commits)

Let's use `flags` here, too.

>  {
>  	struct rev_info revs;
>  	struct commit_list *tried;
> @@ -957,7 +1020,7 @@ int bisect_next_all(const char *prefix, int no_checkout)
>  
>  	bisect_common(&revs);
>  
> -	find_bisection(&revs.commits, &reaches, &all, !!skipped_revs.nr);
> +	find_bisection(&revs.commits, &reaches, &all, !!skipped_revs.nr, only_merge_commits);

With the `flags` change, we would probably just introduce a local variable
`int flags = 0;` and set `flags |= !!skipped_revs.nr;` here at first, and
only change the `flags` variable to a parameter when introducing the
--only-merge-commits flag.

> @@ -983,10 +1046,14 @@ int bisect_next_all(const char *prefix, int no_checkout)
>  	bisect_rev = &revs.commits->item->object.oid;
>  
>  	if (!oidcmp(bisect_rev, current_bad_oid)) {
> +		char *format_string = NULL;
>  		exit_if_skipped_commits(tried, current_bad_oid);
> -		printf("%s is the first %s commit\n", oid_to_hex(bisect_rev),
> -			term_bad);
> -		show_diff_tree(prefix, revs.commits->item);
> +		if (only_merge_commits)
> +			format_string = "%s is the first %s merge\n";
> +		else
> +			format_string = "%s is the first %s commit\n";

... why? We can reference merge commits as "commits", too... Why not leave
the message as-is?

> diff --git a/bisect.h b/bisect.h
> index a5d9248a4..664ada180 100644
> --- a/bisect.h
> +++ b/bisect.h
> @@ -9,7 +9,7 @@
>   * best commit, as chosen by `find_all`.
>   */
>  extern void find_bisection(struct commit_list **list, int *reaches, int *all,
> -			   int find_all);
> +			   int find_all, int only_merge_commits);

Oh, I missed that this is a global function. So the flags should not be
FIND_ALL and ONLY_MERGE_COMMITS but BISECT_FIND_ALL and
BISECT_ONLY_MERGE_COMMITS, and `#define`d above this function.

> diff --git a/builtin/bisect--helper.c b/builtin/bisect--helper.c
> index 4b5fadcbe..9d7a1dd23 100644
> --- a/builtin/bisect--helper.c
> +++ b/builtin/bisect--helper.c
> @@ -106,6 +106,32 @@ static void check_expected_revs(const char **revs, int rev_nr)
>  	}
>  }
>  
> +static GIT_PATH_FUNC(git_path_bisect_merges_only, "MERGES_ONLY_BISECT")

The other bisect-specific files start with BISECT_, let's do that here,
too (i.e. "BISECT_ONLY_MERGES")?

> +static int merges_only(void)
> +{
> +	const char *filename = git_path_bisect_merges_only();
> +	struct stat st;
> +	struct strbuf str = STRBUF_INIT;
> +	FILE *fp;
> +	int res = 0;
> +
> +	if (stat(filename, &st) || !S_ISREG(st.st_mode))
> +		return 0;
> +
> +	fp = fopen_or_warn(filename, "r");
> +	if (!fp)
> +		return 0;
> +
> +	if (strbuf_getline_lf(&str, fp) != EOF)
> +		res = atoi(str.buf);

Why not use the presence of the file as flag, rather than its contents?
Then the test becomes as simple as

	int merges_only = file_exists(git_path_bisect_merges_only());

> +
> +	strbuf_release(&str);
> +	fclose(fp);
> +
> +	return res;
> +}
> +
>  int cmd_bisect__helper(int argc, const char **argv, const char *prefix)
>  {
>  	enum {
> @@ -137,7 +163,7 @@ int cmd_bisect__helper(int argc, const char **argv, const char *prefix)
>  
>  	switch (cmdmode) {
>  	case NEXT_ALL:
> -		return bisect_next_all(prefix, no_checkout);
> +		return bisect_next_all(prefix, no_checkout, merges_only());
>  	case WRITE_TERMS:
>  		if (argc != 2)
>  			return error(_("--write-terms requires two arguments"));
> diff --git a/builtin/rev-list.c b/builtin/rev-list.c
> index fadd3ec14..cacffe6c6 100644
> --- a/builtin/rev-list.c
> +++ b/builtin/rev-list.c
> @@ -538,7 +538,7 @@ int cmd_rev_list(int argc, const char **argv, const char *prefix)
>  	if (bisect_list) {
>  		int reaches, all;
>  
> -		find_bisection(&revs.commits, &reaches, &all, bisect_find_all);
> +		find_bisection(&revs.commits, &reaches, &all, bisect_find_all, 0);

This would become bisect_flags in the preparatory patch.

>  
>  		if (bisect_show_vars)
>  			return show_bisect_vars(&info, reaches, all);
> diff --git a/git-bisect.sh b/git-bisect.sh
> index 54cbfecc5..730c983c5 100755
> --- a/git-bisect.sh
> +++ b/git-bisect.sh
> @@ -82,6 +82,7 @@ bisect_start() {
>  	bad_seen=0
>  	eval=''
>  	must_write_terms=0
> +	merges_only=0
>  	revs=''
>  	if test "z$(git rev-parse --is-bare-repository)" != zfalse
>  	then
> @@ -96,6 +97,9 @@ bisect_start() {
>  			shift
>  			break
>  		;;
> +		--merges-only)
> +			merges_only=1
> +			shift ;;
>  		--no-checkout)
>  			mode=--no-checkout
>  			shift ;;
> @@ -212,6 +216,7 @@ bisect_start() {
>  		git bisect--helper --write-terms "$TERM_BAD" "$TERM_GOOD" || exit
>  	fi &&
>  	echo "git bisect start$orig_args" >>"$GIT_DIR/BISECT_LOG" || exit
> +	echo "$merges_only" >"$GIT_DIR/MERGES_ONLY_BISECT" || exit

The git-bisect.sh script deviates from our coding convention, it seems.
Typically, we would write

	merges_only=
	[...]
		--merges-only)
			merges_only=t
			;;
	[...]
	test -z "$merges_only" || : >"$GIT_DIR/BISECT_ONLY_MERGES" || exit

I would prefer that here, too.

> diff --git a/t/t6070-bisect-merge-commits.sh b/t/t6070-bisect-merge-commits.sh
> new file mode 100755
> index 000000000..edfae0cfd
> --- /dev/null
> +++ b/t/t6070-bisect-merge-commits.sh
> @@ -0,0 +1,116 @@
> +#!/bin/sh
> +#
> +# Copyright (c) 2018 Harald Nordgren
> +#
> +test_description='Tests git bisect merge commit functionality'
> +
> +exec </dev/null

This is unnecessary, as our test-lib.sh already executes `exec 6<&0`.
Further, it is not only unnecessary, but also breaks debugging with the
`debug` shell function defined in test-lib-function.sh. Let's just drop
this statement.

> +
> +. ./test-lib.sh
> +
> +add_line_into_file()
> +{
> +    _line=$1
> +    _file=$2
> +
> +    if [ -f "$_file" ]; then
> +        echo "$_line" >> $_file || return $?
> +        MSG="Add <$_line> into <$_file>."
> +    else
> +        echo "$_line" > $_file || return $?
> +        git add $_file || return $?
> +        MSG="Create file <$_file> with <$_line> inside."
> +    fi
> +
> +    test_tick
> +    git commit --quiet -m "$MSG" $_file
> +}

It would appear that the contents of the files written using this function
are never tested.

Let's just use `test_commit` instead.

> +
> +create_merge_commit()
> +{
> +    _branch=$1
> +    _file=$2
> +
> +     git checkout -b $_branch &&
> +     add_line_into_file "hello" $_file &&
> +     git checkout master &&
> +     git merge $_branch --no-edit --no-ff
> +}

This not only creates a merge commit but also sets up a branch and adds a
commit to it. I'd much rather have a function that performs the merge
after calling `test_tick`, and then tags the result just like
`test_commit` would do.

> +HASH1=
> +HASH2=
> +HASH3=
> +HASH4=
> +HASH5=
> +HASH6=

These are too non-descriptive names to be useful. Remember that the
purpose of a regression test is to help debugging regressions. If I
encounter a regression while developing a patch series that breaks this
test, I really need to understand the regression test as quickly as
possible.

So let's use HEAD~4 and/or the tags creates by `test_commit` instead.

> +test_expect_success 'set up basic repo with 3 files and 3 merge commits' '

If we say 'setup' here, it is not only conciser, but also less prone to
become incorrect in the future.

> +    add_line_into_file "hello" hello &&
> +    HASH1=$(git rev-parse --verify HEAD) &&
> +
> +    add_line_into_file "hello" hello &&
> +    git checkout -b branch1 &&
> +    add_line_into_file "hello" hello1 &&
> +    git checkout master &&
> +    git checkout -b branch2 &&
> +    add_line_into_file "hello" hello2 &&
> +    git checkout master &&
> +    git checkout -b branch3 &&
> +    add_line_into_file "hello" hello3 &&
> +    git checkout master &&
> +    git merge branch1 --no-edit --no-ff &&
> +    HASH2=$(git rev-parse --verify HEAD) &&
> +
> +    add_line_into_file "hello" hello &&
> +    add_line_into_file "hello" hello &&
> +    git merge branch2 --no-edit --no-ff &&
> +    git merge branch3 --no-edit --no-ff &&
> +    HASH3=$(git rev-parse --verify HEAD) &&
> +
> +    add_line_into_file "hello" hello &&
> +    HASH4=$(git rev-parse --verify HEAD) &&
> +
> +    create_merge_commit branch4 hello4 &&
> +    create_merge_commit branch5 hello5 &&
> +    create_merge_commit branch6 hello6 &&
> +    create_merge_commit branch7 hello7 &&
> +    create_merge_commit branch8 hello8 &&
> +    create_merge_commit branch9 hello9 &&
> +    create_merge_commit branch10 hello10 &&
> +    create_merge_commit branch11 hello11 &&
> +    create_merge_commit branch12 hello12 &&
> +    create_merge_commit branch13 hello13 &&
> +    create_merge_commit branch14 hello14 &&
> +    create_merge_commit branch15 hello15 &&
> +    create_merge_commit branch16 hello16 &&
> +    HASH5=$(git rev-parse --verify HEAD) &&
> +
> +    create_merge_commit branch17 hello17 &&
> +    create_merge_commit branch18 hello18 &&
> +    create_merge_commit branch19 hello19 &&
> +    create_merge_commit branch20 hello20 &&
> +    HASH6=$(git rev-parse --verify HEAD)
> +'

It is next to impossible to understand this function in an 80x25 terminal
(which many of us old-timers still use). And it is not even necessary in
order to test the functionality you introduced...

So how about this strategy: first, paint a nice little diagram of the
commit graph you want to achieve, and make it not as insanely complex.
Then, set that up using `test_commit`, `git checkout -b` and `test_tick &&
git merge`. Something like this:

# We generate the following commit graph:
#
# A - B - C - F
#   \   \   /   \
#     D - E - G - H

test_merge () {
	test_tick &&
	git merge "$@" &&
	git tag "$1"
}

test_expect_success 'setup' '
	test_commit A &&
	test_commit B &&
	test_commit C &&
	git reset --hard A &&
	test_commit D &&
	test_merge -m E B &&
	git reset --hard C &&
	git merge -m F E &&
	git reset --hard G &&
	git merge -m H F
'

> +test_expect_success 'bisect skip: successful result' '

I do not actually see that `bisect skip` is tested here.

> +	test_when_finished git bisect reset &&
> +	git bisect reset &&
> +	git bisect start $HASH4 $HASH1 --merges-only &&
> +	git bisect good &&
> +	git bisect good &&
> +	git bisect bad > my_bisect_log.txt &&
> +	grep "$HASH3 is the first bad merge" my_bisect_log.txt
> +'

What we want to verify here is that we only encounter merge commits,
right? So that is what we should test:

test_expect_success '--merges-only' '
	write_script find-c.sh <<-\EOF &&
	case "$(git show --format=%p -s)" in
	*\ *) ;; # merge commit: okay
	*) git rev-parse HEAD >non-merge-found;;
	esac
	# detect whether we are "before" or "after" C
	test ! -f C.t
	EOF

	git bisect start HEAD A &&
	git bisect run ./find-c.sh >actual &&
	test_path_is_missing non-merge-found &&
	grep "$(git rev-parse F) is the first bad commit" actual
'

> +
> +test_expect_success 'bisect skip: successful result' '
> +	test_when_finished git bisect reset &&
> +	git bisect reset &&
> +	git bisect start $HASH6 $HASH2 --merges-only &&
> +	git bisect good &&
> +	git bisect good &&
> +	git bisect bad &&
> +	git bisect bad > my_bisect_log.txt &&
> +	grep "$HASH5 is the first bad merge" my_bisect_log.txt
> +'
> +
> +test_done

So after reading this, I am quite a bit more puzzled about that
commit->util field we seem to use. Could you clarify that in the commit
message, to give readers a head start?

Okay, now let's come back to that function accepting a commit_list and
dropping all non-merge commits from it.

First of all, I do not think that we actually make use of the parents
information after calling this function, right? If we did, I think it
would be horribly broken because all non-merge commits would now be
looking like root commits, and all the merge commits would pretend to have
*all* of the previous merge commits as parents (the first merge commit
would therefore look like a root commit, the second merge commit would
look like a non-merge, and so on).

So let's just drop the parent rewriting? If we do need to collapse the
commit graph to the merge commit nodes, then a much better idea would be
to use an oidset with commit_lists representing the collapsed vertices in
that graph (i.e. for every commit, it would replace the non-merge parents
with the respective closest ancestor that *is* a merge, and this list
would be constructed in reverse topological order to give linear
complexity).

Under the assumption that we can leave the commits' parents alone, this is
my proposed function to keep only merge commits in a commit_list:

static void keep_only_merge_commits(struct commit_list **p)
{
	struct commit_list *parents, *next;

	while (*p) {
		parents = (*p)->item->parents;
		if (parents && parents->next) {
			/* merge commit: keep */
			p = &(*p)->next;
			continue;
		}

		/* non-merge commit: free() and skip */
		next = (*p)->next;
		free(*p);
		*p = next;
	}
}

This makes use of the fact that both the original list of commits, as well
as the `next` field in the commit_list items, is a pointer to a
commit_list. So once we processed the first item of the list, we either
free it and replace the list by the remainder of the list, or we advance
to the "sub list" starting with the second item of the list (if there is
any left).

Ciao,
Johannes