Re: [PATCH v7 11/15] merge-octopus: rewrite in C

Johannes Schindelin <Johannes.Schindelin@xxxxxx> · Wed, 24 Mar 2021 00:58:38 +0100 (CET)

Hi Alban,

On Wed, 17 Mar 2021, Alban Gruin wrote:

> This rewrites `git merge-octopus' from shell to C.  As for the two last
> conversions, this port removes calls to external processes to avoid
> reading and writing the index over and over again.
>
>  - Calls to `read-tree -u -m (--aggressive)?' are replaced by calls to
>    unpack_trees().
>
>  - The call to `write-tree' is replaced by a call to
>    write_index_as_tree().
>
>  - The call to `diff-index ...' is replaced by a call to
>    repo_index_has_changes().
>
>  - The call to `merge-index', needed to invoke `git merge-one-file', is
>    replaced by a call to merge_all_index().
>
> The index is read in cmd_merge_octopus(), and is wrote back by

s/wrote/written/

> merge_strategies_octopus().

I wonder why, though. Maybe the commit message could clarify that?

> Here to, merge_strategies_octopus() takes two commit lists and a string

s/to,/too,/

> to reduce frictions when try_merge_strategies() will be modified to call

s/frictions/friction/

> it directly.
>
> Signed-off-by: Alban Gruin <alban.gruin@xxxxxxxxx>
> ---
>
> [...]
> diff --git a/builtin/merge-octopus.c b/builtin/merge-octopus.c
> new file mode 100644
> index 0000000000..9b9939b6b2
> --- /dev/null
> +++ b/builtin/merge-octopus.c
> @@ -0,0 +1,70 @@
> +/*
> + * Builtin "git merge-octopus"
> + *
> + * Copyright (c) 2020 Alban Gruin
> + *
> + * Based on git-merge-octopus.sh, written by Junio C Hamano.
> + *
> + * Resolve two or more trees.
> + */
> +
> +#include "cache.h"
> +#include "builtin.h"
> +#include "commit.h"
> +#include "merge-strategies.h"
> +
> +static const char builtin_merge_octopus_usage[] =
> +	"git merge-octopus [<bases>...] -- <head> <remote1> <remote2> [<remotes>...]";
> +
> +int cmd_merge_octopus(int argc, const char **argv, const char *prefix)
> +{
> +	int i, sep_seen = 0;
> +	struct commit_list *bases = NULL, *remotes = NULL;
> +	struct commit_list **next_base = &bases, **next_remote = &remotes;
> +	const char *head_arg = NULL;
> +	struct repository *r = the_repository;
> +
> +	if (argc < 5)
> +		usage(builtin_merge_octopus_usage);
> +
> +	setup_work_tree();
> +	if (repo_read_index(r) < 0)
> +		die("invalid index");
> +
> +	/*
> +	 * The first parameters up to -- are merge bases; the rest are
> +	 * heads.
> +	 */
> +	for (i = 1; i < argc; i++) {
> +		if (strcmp(argv[i], "--") == 0)
> +			sep_seen = 1;
> +		else if (strcmp(argv[i], "-h") == 0)
> +			usage(builtin_merge_octopus_usage);
> +		else if (sep_seen && !head_arg)
> +			head_arg = argv[i];
> +		else {
> +			struct object_id oid;
> +			struct commit *commit;
> +
> +			if (get_oid(argv[i], &oid))
> +				die("object %s not found.", argv[i]);
> +
> +			commit = oideq(&oid, r->hash_algo->empty_tree) ?
> +				NULL : lookup_commit_or_die(&oid, argv[i]);
> +
> +			if (sep_seen)
> +				next_remote = commit_list_append(commit, next_remote);
> +			else
> +				next_base = commit_list_append(commit, next_base);
> +		}
> +	}
> +
> +	/*
> +	 * Reject if this is not an octopus -- resolve should be used
> +	 * instead.
> +	 */
> +	if (commit_list_count(remotes) < 2)
> +		return 2;

As with `merge-resolve`, I would suggest to:

- move this input validation down to `merge_strategies_octopus()`, and
- change that function's signature to return an `enum`, and then
- make sure that that `enum` uses easy-to-understand labels.

> +
> +	return merge_strategies_octopus(r, bases, head_arg, remotes);
> +}
>
> [...]
>
> diff --git a/merge-strategies.c b/merge-strategies.c
> index a51700dae5..ebc0d0b1e2 100644
> --- a/merge-strategies.c
> +++ b/merge-strategies.c
> @@ -367,3 +368,177 @@ int merge_strategies_resolve(struct repository *r,
>
>  	return 0;
>  }
> +
> +static int write_tree(struct repository *r, struct tree **reference_tree)
> +{
> +	struct object_id oid;
> +	int ret;
> +
> +	if (!(ret = write_index_as_tree(&oid, r->index, r->index_file,
> +					WRITE_TREE_SILENT, NULL)))
> +		*reference_tree = lookup_tree(r, &oid);
> +
> +	return ret;
> +}
> +
> +static int octopus_fast_forward(struct repository *r, const char *branch_name,
> +				struct tree *tree_head, struct tree *current_tree,
> +				struct tree **reference_tree)

While I objected to the name of the `fast_forward()` function, I think the
`octopus_fast_forward()` function is named aptly.

> +{
> +	/*
> +	 * The first head being merged was a fast-forward.  Advance the
> +	 * reference commit to the head being merged, and use that tree
> +	 * as the intermediate result of the merge.  We still need to
> +	 * count this as part of the parent set.
> +	 */
> +	struct tree_desc t[2];
> +
> +	printf(_("Fast-forwarding to: %s\n"), branch_name);
> +
> +	init_tree_desc(t, tree_head->buffer, tree_head->size);
> +	if (add_tree(current_tree, t + 1))
> +		return -1;
> +	if (fast_forward(r, t, 2, 0))
> +		return -1;
> +	if (write_tree(r, reference_tree))
> +		return -1;
> +
> +	return 0;
> +}
> +
> +static int octopus_do_merge(struct repository *r, const char *branch_name,
> +			    struct commit_list *common, struct tree *current_tree,
> +			    struct tree **reference_tree)
> +{
> +	struct tree_desc t[MAX_UNPACK_TREES];
> +	struct commit_list *i;
> +	int nr = 0, ret = 0;
> +
> +	printf(_("Trying simple merge with %s\n"), branch_name);
> +
> +	for (i = common; i; i = i->next) {
> +		struct tree *tree = repo_get_commit_tree(r, i->item);
> +		if (add_tree(tree, t + (nr++)))
> +			return -1;
> +	}
> +
> +	if (add_tree(*reference_tree, t + (nr++)))
> +		return -1;
> +	if (add_tree(current_tree, t + (nr++)))
> +		return -1;
> +	if (fast_forward(r, t, nr, 1))
> +		return 2;
> +
> +	if (write_tree(r, reference_tree)) {
> +		struct lock_file lock = LOCK_INIT;
> +
> +		puts(_("Simple merge did not work, trying automatic merge."));
> +		repo_hold_locked_index(r, &lock, LOCK_DIE_ON_ERROR);

It is a bit funny to see this as the only time in this patch where the
index is locked, and it is immediately released thereafter.

I would have expected the lock to be taken first thing in
`merge_strategies_octopus()` and then being committed only on success, or
on failure to merge.

> +		ret = !!merge_all_index(r->index, 0, 0, merge_one_file_func, NULL);
> +		write_locked_index(r->index, &lock, COMMIT_LOCK);
> +
> +		write_tree(r, reference_tree);
> +	}
> +
> +	return ret;
> +}
> +
> +int merge_strategies_octopus(struct repository *r,
> +			     struct commit_list *bases, const char *head_arg,
> +			     struct commit_list *remotes)
> +{
> +	int ff_merge = 1, ret = 0, nr_references = 1;
> +	struct commit **reference_commits, *head_commit;
> +	struct tree *reference_tree, *head_tree;
> +	struct commit_list *i;
> +	struct object_id head;
> +	struct strbuf sb = STRBUF_INIT;
> +
> +	get_oid(head_arg, &head);
> +	head_commit = lookup_commit_reference(r, &head);
> +	head_tree = repo_get_commit_tree(r, head_commit);
> +
> +	if (parse_tree(head_tree))
> +		return 2;
> +
> +	if (repo_index_has_changes(r, head_tree, &sb)) {
> +		error(_("Your local changes to the following files "
> +			"would be overwritten by merge:\n  %s"),
> +		      sb.buf);
> +		strbuf_release(&sb);
> +		return 2;
> +	}
> +
> +	CALLOC_ARRAY(reference_commits, commit_list_count(remotes) + 1);
> +	reference_commits[0] = head_commit;
> +	reference_tree = head_tree;
> +
> +	for (i = remotes; i && i->item; i = i->next) {
> +		struct commit *c = i->item;
> +		struct object_id *oid = &c->object.oid;
> +		struct tree *current_tree = repo_get_commit_tree(r, c);
> +		struct commit_list *common, *j;
> +		char *branch_name = merge_get_better_branch_name(oid_to_hex(oid));
> +		int up_to_date = 0;
> +
> +		common = repo_get_merge_bases_many(r, c, nr_references, reference_commits);
> +		if (!common) {
> +			error(_("Unable to find common commit with %s"), branch_name);
> +
> +			free(branch_name);
> +			free_commit_list(common);
> +			free(reference_commits);
> +
> +			return 2;
> +		}
> +
> +		for (j = common; j && !up_to_date && ff_merge; j = j->next) {
> +			up_to_date |= oideq(&j->item->object.oid, oid);

Semantically, I would argue that this is an `||=`, not `|=`: we want a
Boolean "or", not a bit-wise one.

> +
> +			if (!j->next &&
> +			    !oideq(&j->item->object.oid,
> +				   &reference_commits[nr_references - 1]->object.oid))
> +				ff_merge = 0;
> +		}

Hmm. This is combining two things into the same loop, with a combined loop
condition. The two things are:

	case "$LF$common$LF" in
        *"$LF$SHA1$LF"*)
                eval_gettextln "Already up to date with \$pretty_name"
                continue
                ;;
        esac

        if test "$common,$NON_FF_MERGE" = "$MRC,0"
        then
                # The first head being merged was a fast-forward.
                # Advance MRC to the head being merged, and use that
                # tree as the intermediate result of the merge.
                # We still need to count this as part of the parent set.

                eval_gettextln "Fast-forwarding to: \$pretty_name"
                git read-tree -u -m $head $SHA1 || exit
                MRC=$SHA1 MRT=$(git write-tree)
                continue
        fi

        NON_FF_MERGE=1

The first one tries to verify that the `common` list contains `oid`. The C
code does this, too, using the intuitive variable name `up_to_date`, which
is good.

Now, big question: is there a way for the loop to exit before we had a
chance to see the common commit that is identical to `oid`? And I think
there is: `ff_merge` is not reset between the outer loop (the one
iterating over `remotes`). If that is the case, then we would miss that
we're already up to date.

Next thing is that `if test "$common,$NON_FF_MERGE" = "$MRC,0"` thing.
This is turned into that `if (!j->next && ...)` thing, and I _think_ that
it does the wrong thing. Rather than verifying that the `common` list
is identical to "MRC" (= the merge reference list), it would only ever
compare the last entries of `common` and MRC.

I have a hard time convincing myself that this is idempotent to the shell
script version.

Instead, I think it should read somewhat like this:

		for (j = common, k = 0; j && (!up_to_date || ff_merge); j = j->next) {
			up_to_date ||= oideq(&j->item->object.oid, oid);

			if (ff_merge &&
			    (k >= nr_references ||
			     !oideq(&j->item->object.oid,
				    &reference_commits[k++]->object.oid))
				ff_merge = 0;
		}

But quite honestly, this still looks "too clever" and too fragile to me.
For something as rare as an octopus merge, I'd _much_ rather have simpler
code that is easy to reason about and does the job reliably (if somewhat
slower than a hyper-optimized version):

		/*
		 * If `oid` is reachable from `HEAD`, we're already up to
		 * date.
		 */
		for (j = common; j; j = j->next)
			if (oideq(&j->item->object.oid, oid)) {
				up_to_date = 1;
				break;
			}

		if (up_to_date) {
			printf(_("Already up to date with %s\n"), branch_name);

			free(branch_name);
			free_commit_list(common);
			continue;
		}

		for (j = common, k = 0; ff_merge && j; j = j->next)
			if (k >= nr_references ||
			    !oideq(&j->item->object.oid,
				   &reference_commits[k++]->object.oid))
				ff_merge = 0;
		if (k != nr_references)
			ff_merge = 0;

But the more I stare at the shell script code, the more I start to believe
that this `MRC` business is just a very convoluted way to essentially
verify that the `HEAD` is the _single_ merge base.

I say that because I cannot fail to notice that `$common` separates the
merge bases by newlines, while `$MRC` separates its entries by spaces.
Therefore,

		test "$common,$NON_FF_MERGE" = "$MRC,0"

can only ever evaluate to `true` if both `$common` and `$MRC` contains
exactly one and the same oid, namely the one of the revision to which we
just fast-forwarded in the previous iteration.

Therefore, the logic does not even need a loop. It would be as trivial as:

		/*
		 * If we could fast-forward so far and `HEAD` is the
		 * single merge base with the current `remote` revision,
		 * keep fast-forwarding.
		 */
		if (ff_merge && common && !common->next && nr_references == 1 &&
		    oideq(common->item->object.oid,
			  reference_commit[0]->object.oid)) {
			ret = octopus_fast_forward(r, branch_name, head_tree,
						   current_tree, &reference_tree);
			nr_references = 0;
		} else {
			ff_merge = 0;
			ret = octopus_do_merge(r, branch_name, common,
					       current_tree, &reference_tree);
		}

> +
> +		if (up_to_date) {
> +			printf(_("Already up to date with %s\n"), branch_name);
> +
> +			free(branch_name);
> +			free_commit_list(common);
> +			continue;
> +		}
> +
> +		if (ff_merge) {
> +			ret = octopus_fast_forward(r, branch_name, head_tree,
> +						   current_tree, &reference_tree);
> +			nr_references = 0;
> +		} else {
> +			ret = octopus_do_merge(r, branch_name, common,
> +					       current_tree, &reference_tree);
> +		}
> +
> +		free(branch_name);
> +		free_commit_list(common);
> +
> +		if (ret == -1 || ret == 2)
> +			break;
> +		else if (ret && i->next) {
> +			/*
> +			 * We allow only last one to have a
> +			 * hand-resolvable conflicts.  Last round failed
> +			 * and we still had a head to merge.
> +			 */
> +			puts(_("Automated merge did not work."));
> +			puts(_("Should not be doing an octopus."));
> +
> +			free(reference_commits);
> +			return 2;

I see that you moved this block from the beginning of the loop to the end
(in the script, it was at the start of the loop). This is a good change.

I wonder, though, whether it wouldn't make more sense to replace the last
two lines with this:

			ret = 2;
			break;

That way, we need not worry about releasing resources in multiple places
in the future: it will all be done at the end of the function.

Phew. What a lot to unpack.

Please let me express my gratitude for working on this. My many comments
may seem as if I am unhappy with the progress, but nothing could be
further from the truth. I am impressed by your tenacity, and I hope that I
could do my little bit to make this patch series as good as we can.

Thanks,
Dscho

> +		}
> +
> +		reference_commits[nr_references++] = c;
> +	}
> +
> +	free(reference_commits);
> +	return ret;
> +}
> diff --git a/merge-strategies.h b/merge-strategies.h
> index bba4bf999c..8de2249ee6 100644
> --- a/merge-strategies.h
> +++ b/merge-strategies.h
> @@ -32,5 +32,8 @@ int merge_all_index(struct index_state *istate, int oneshot, int quiet,
>  int merge_strategies_resolve(struct repository *r,
>  			     struct commit_list *bases, const char *head_arg,
>  			     struct commit_list *remote);
> +int merge_strategies_octopus(struct repository *r,
> +			     struct commit_list *bases, const char *head_arg,
> +			     struct commit_list *remote);
>
>  #endif /* MERGE_STRATEGIES_H */
> --
> 2.31.0
>
>