Re: [PATCH v3 02/11] merge-one-file: rewrite in C

Alban Gruin <alban.gruin@xxxxxxxxx> · Wed, 21 Oct 2020 21:47:39 +0200

Hi Junio,

On 07/10/2020 00:01, Junio C Hamano wrote:
> Alban Gruin <alban.gruin@xxxxxxxxx> writes:
> 
>> This rewrites `git merge-one-file' from shell to C.  This port is not
>> completely straightforward: to save precious cycles by avoiding reading
>> and flushing the index repeatedly, write temporary files when an
>> operation can be performed in-memory, or allow other function to use the
>> rewrite without forking nor worrying about the index,...
> 
> So, the in-core index is still used, but when the contents of the in-core
> index does not have to be written out disk, we just don't?  Makes sense.
> 

>> diff --git a/builtin/merge-one-file.c b/builtin/merge-one-file.c
>> new file mode 100644
>> index 0000000000..598338ba16
>> --- /dev/null
>> +++ b/builtin/merge-one-file.c
>> @@ -0,0 +1,92 @@
>> +/*
>> + * Builtin "git merge-one-file"
>> + *
>> + * Copyright (c) 2020 Alban Gruin
>> + *
>> + * Based on git-merge-one-file.sh, written by Linus Torvalds.
>> + *
>> + * This is the git per-file merge utility, called with
>> + *
>> + *   argv[1] - original file SHA1 (or empty)
>> + *   argv[2] - file in branch1 SHA1 (or empty)
>> + *   argv[3] - file in branch2 SHA1 (or empty)
> 
> Let's modernize this comment while we are at it.
> 
>     SHA1 -> "object name" (or "blob object name")
> 
>> + *   argv[4] - pathname in repository
>> + *   argv[5] - original file mode (or empty)
>> + *   argv[6] - file in branch1 mode (or empty)
>> + *   argv[7] - file in branch2 mode (or empty)
>> + *
>> + * Handle some trivial cases. The _really_ trivial cases have been
>> + * handled already by git read-tree, but that one doesn't do any merges
>> + * that might change the tree layout.
>> + */
>> +
>> +#define USE_THE_INDEX_COMPATIBILITY_MACROS
>> +#include "cache.h"
>> +#include "builtin.h"
>> +#include "lockfile.h"
>> +#include "merge-strategies.h"
>> +
>> +static const char builtin_merge_one_file_usage[] =
>> +	"git merge-one-file <orig blob> <our blob> <their blob> <path> "
>> +	"<orig mode> <our mode> <their mode>\n\n"
>> +	"Blob ids and modes should be empty for missing files.";
>> +
>> +static int read_mode(const char *name, const char *arg, unsigned int *mode)
>> +{
>> +	char *last;
>> +	int ret = 0;
>> +
>> +	*mode = strtol(arg, &last, 8);
>> +
>> +	if (*last)
>> +		ret = error(_("invalid '%s' mode: expected nothing, got '%c'"), name, *last);
>> +	else if (!(S_ISREG(*mode) || S_ISDIR(*mode) || S_ISLNK(*mode)))
>> +		ret = error(_("invalid '%s' mode: %o"), name, *mode);
>> +
>> +	return ret;
>> +}
>> +
>> +int cmd_merge_one_file(int argc, const char **argv, const char *prefix)
>> +{
>> +	struct object_id orig_blob, our_blob, their_blob,
>> +		*p_orig_blob = NULL, *p_our_blob = NULL, *p_their_blob = NULL;
>> +	unsigned int orig_mode = 0, our_mode = 0, their_mode = 0, ret = 0;
>> +	struct lock_file lock = LOCK_INIT;
>> +
>> +	if (argc != 8)
>> +		usage(builtin_merge_one_file_usage);
>> +
>> +	if (read_cache() < 0)
>> +		die("invalid index");
>> +
>> +	hold_locked_index(&lock, LOCK_DIE_ON_ERROR);
>> +
>> +	if (!get_oid(argv[1], &orig_blob)) {
>> +		p_orig_blob = &orig_blob;
>> +		ret = read_mode("orig", argv[5], &orig_mode);
>> +	}
> 
> argv[1] is defined as "either the object name of the blob in the
> common ancestor, or an empty string".  So you need to distinguish
> three cases here, but you are only catching two.
> 
>  - argv[1] is an empty string; p_orig_blob can legitimately be left
>    NULL.
> 
>  - argv[1] is a valid blob object name.  orig_blob should be
>    populated and p_orig_blob should point at it.
> 
>  - argv[1] is garbage, names a non-blob object, or there is no such
>    object with that name.  Don't we want to catch it as a mistake?
> 
> Also, when argv[1] is an empty string, argv[5] must also be an empty
> string, or we got a wrong input---don't we want to catch it as a
> mistake?
> 
> The third case needs a bit of thought.  For example, if $1 and $2
> are the same and points at a non-existent object, we know we won't
> care because we only care about $3.  In a lazily-cloned repository,
> that may matter---we would not want to fail even if we not have blob
> $1 and $2, as long as they are reasonably spelled a full hexadecimal
> object name.  But we would want to fail if blob object named by $3
> is missing.
> 
> One way to achieve semantics closer to the above than the posted
> patch may be to tighten the parsing.  Instead of using "anything
> goes" get_oid(), use get_oid_hex(), perhaps.
> 
>> +	if (!get_oid(argv[2], &our_blob)) {
>> +		p_our_blob = &our_blob;
>> +		ret = read_mode("our", argv[6], &our_mode);
>> +	}
>> +
>> +	if (!get_oid(argv[3], &their_blob)) {
>> +		p_their_blob = &their_blob;
>> +		ret = read_mode("their", argv[7], &their_mode);
>> +	}
>> +
>> +	if (ret)
>> +		return ret;
>> +
>> +	ret = merge_strategies_one_file(the_repository,
>> +					p_orig_blob, p_our_blob, p_their_blob, argv[4],
>> +					orig_mode, our_mode, their_mode);
> 
> That's a funny function name.  It's not like the function will be
> taught different strategy to handle the three-way merge, no?  It
> probably makes sense to name it after what it does, which is "three
> way merge".
> 

Okay.  There's already a function called threeway_merge() in
unpack_trees() that does something different.
merge_strategies_threeway() should be good?

>> +	if (ret) {
>> +		rollback_lock_file(&lock);
>> +		return !!ret;
>> +	}
>> +
>> +	return write_locked_index(&the_index, &lock, COMMIT_LOCK);
>> +}
> 
>> diff --git a/merge-strategies.c b/merge-strategies.c
>> new file mode 100644
>> index 0000000000..bbe6f48698
>> --- /dev/null
>> +++ b/merge-strategies.c
>> @@ -0,0 +1,214 @@
>> +#include "cache.h"
>> +#include "dir.h"
>> +#include "ll-merge.h"
>> +#include "merge-strategies.h"
>> +#include "xdiff-interface.h"
>> +
> 
>> +static int add_to_index_cacheinfo(struct index_state *istate,
>> +				  unsigned int mode,
>> +				  const struct object_id *oid, const char *path)
>> +{
>> +	struct cache_entry *ce;
>> +	int len, option;
>> +
>> +	if (!verify_path(path, mode))
>> +		return error(_("Invalid path '%s'"), path);
>> +
>> +	len = strlen(path);
>> +	ce = make_empty_cache_entry(istate, len);
>> +
>> +	oidcpy(&ce->oid, oid);
>> +	memcpy(ce->name, path, len);
>> +	ce->ce_flags = create_ce_flags(0);
>> +	ce->ce_namelen = len;
>> +	ce->ce_mode = create_ce_mode(mode);
>> +	if (assume_unchanged)
>> +		ce->ce_flags |= CE_VALID;
>> +	option = ADD_CACHE_OK_TO_ADD | ADD_CACHE_OK_TO_REPLACE;
>> +	if (add_index_entry(istate, ce, option))
>> +		return error(_("%s: cannot add to the index"), path);
>> +
>> +	return 0;
>> +}
> 
> The above correctly does 'git update-index --add --cacheinfo "$6"
> "$2" "$4"' but don't copy-and-paste existing code to do so.  Add one
> preliminary patch before everything else in the series to massage
> and extract add_cacheinfo() function out of builtin/update-index.c,
> move it to somewhere common like read-cache.c and so that we can
> call it from here.
> 

Hmm, I’d really like to do this, but I have one remark/question about
it.  In builtin/update-index.c, when add_cache_entry() fails, this
message is printed:

	cannot add to the index - missing --add option?

Obviously, this is not what we want to show in git-merge when
add_index_entry() fails.  But then, verify_path() can also fail, and
will show a sensible message for any situation:

	Invalid path '%s'

Should I return error when verify_path() fails, but eg. -2 in the case
of add_index_entry(), and if this new add_cacheinfo() returns -2 but not
-1, print the correct message?  Or let the caller verify the path so it
cannot fail because of this?

>> +static int checkout_from_index(struct index_state *istate, const char *path)
>> +{
>> +	struct checkout state = CHECKOUT_INIT;
>> +	struct cache_entry *ce;
>> +
>> +	state.istate = istate;
>> +	state.force = 1;
>> +	state.base_dir = "";
>> +	state.base_dir_len = 0;
>> +
>> +	ce = index_file_exists(istate, path, strlen(path), 0);
> 
> This call is unfortunate for the reasons I mention later.
> 
> But if you must have this call, then you need to sanity check what
> you get from index_file_exists().  ce must be a merged cache entry,
> so
> 
> 	if (!ce || ce_stage(ce))
> 		BUG(...);
> 

That’s ok, I managed to remove it following your advice.

>> +	if (checkout_entry(ce, &state, NULL, NULL) < 0)
>> +		return error(_("%s: cannot checkout file"), path);
>> +	return 0;
>> +}
>> +
>> +static int merge_one_file_deleted(struct index_state *istate,
>> +				  const struct object_id *orig_blob,
>> +				  const struct object_id *our_blob,
>> +				  const struct object_id *their_blob, const char *path,
>> +				  unsigned int orig_mode, unsigned int our_mode, unsigned int their_mode)
>> +{
>> +	if ((our_blob && orig_mode != our_mode) ||
>> +	    (their_blob && orig_mode != their_mode))
>> +		return error(_("File %s deleted on one branch but had its "
>> +			       "permissions changed on the other."), path);
>> +
>> +	if (our_blob) {
>> +		printf(_("Removing %s\n"), path);
>> +
>> +		if (file_exists(path))
>> +			remove_path(path);
>> +	}
>> +
>> +	if (remove_file_from_index(istate, path))
>> +		return error("%s: cannot remove from the index", path);
>> +	return 0;
> 
> If the side that did not remove changed the mode, we don't silently
> remove but fail and give a chance to inspect the situation to the
> end user.  If we had the blob and it is removed by them, we give a
> message and only in that case we remove the file from the working
> tree, together with any leading directory that has become empty.
> 
> And after that we make sure that the path is no longer in the
> index.  The function removes entries for the path at all the stages,
> which is exactly what we want.
> 
> OK.
> 
>> +}
>> +
>> +static int do_merge_one_file(struct index_state *istate,
>> +			     const struct object_id *orig_blob,
>> +			     const struct object_id *our_blob,
>> +			     const struct object_id *their_blob, const char *path,
>> +			     unsigned int orig_mode, unsigned int our_mode, unsigned int their_mode)
>> +{
>> +	int ret, i, dest;
>> +	ssize_t written;
>> +	mmbuffer_t result = {NULL, 0};
>> +	mmfile_t mmfs[3];
>> +	struct ll_merge_options merge_opts = {0};
>> +	struct cache_entry *ce;
>> +
>> +	if (our_mode == S_IFLNK || their_mode == S_IFLNK)
>> +		return error(_("%s: Not merging symbolic link changes."), path);
>> +	else if (our_mode == S_IFGITLINK || their_mode == S_IFGITLINK)
>> +		return error(_("%s: Not merging conflicting submodule changes."), path);
>> +
>> +	read_mmblob(mmfs + 1, our_blob);
>> +	read_mmblob(mmfs + 2, their_blob);
>> +
>> +	if (orig_blob) {
>> +		printf(_("Auto-merging %s\n"), path);
>> +		read_mmblob(mmfs + 0, orig_blob);
>> +	} else {
>> +		printf(_("Added %s in both, but differently.\n"), path);
>> +		read_mmblob(mmfs + 0, &null_oid);
>> +	}
>> +
>> +	merge_opts.xdl_opts = XDL_MERGE_ZEALOUS_ALNUM;
>> +	ret = ll_merge(&result, path,
>> +		       mmfs + 0, "orig",
>> +		       mmfs + 1, "our",
>> +		       mmfs + 2, "their",
>> +		       istate, &merge_opts);
> 
> Is it correct to call into ll_merge() here?  The original used to
> call "git merge-file" which called into xdl_merge().  Calling into
> ll_merge() means the path is used to look up the attributes and use
> the custom merge driver, which I am not offhand sure is what we want
> to see at this low level (and if it turns out to be a good idea, we
> definitely should explain the change of semantics in the proposed
> log message for this commit).
> 
>> +	for (i = 0; i < 3; i++)
>> +		free(mmfs[i].ptr);
>> +
>> +	if (ret < 0) {
>> +		free(result.ptr);
>> +		return error(_("Failed to execute internal merge"));
>> +	}
>> +
>> +	/*
>> +	 * Create the working tree file, using "our tree" version from
>> +	 * the index, and then store the result of the merge.
>> +	 */
> 
> The above is copied from the original, to explain what it did after
> the comment, but it does not seem to match what the new code does.
> 
>> +	ce = index_file_exists(istate, path, strlen(path), 0);
>> +	if (!ce)
>> +		BUG("file is not present in the cache?");
>> +
>> +	unlink(path);
>> +	if ((dest = open(path, O_WRONLY | O_CREAT, ce->ce_mode)) < 0) {
>> +		free(result.ptr);
>> +		return error_errno(_("failed to open file '%s'"), path);
>> +	}
>> +
>> +	written = write_in_full(dest, result.ptr, result.size);
>> +	close(dest);
>> +
>> +	free(result.ptr);
>> +
>> +	if (written < 0)
>> +		return error_errno(_("failed to write to '%s'"), path);
>> +
> 
> This open(..., ce->ce_mode) call is way insufficient.
> 
> The comment we have above this part of the code talks about the
> difficulty of doing this correctly in scripted version.  Creating a
> file by 'git checkout-index -f --stage=2 -- "$4"' and reusing it to
> store the merged contents was the cleanest and easiest way without
> having direct access to adjust_shared_perm() to create a working
> tree file with the correct permission bits.
> 
> We are writing in C, so we should be able to do much better than the
> scripted version, as we can later call adjust_shared_perm().
> 

I'm not sure I understand the issue correctly.

Is this because I fetch an entry from the index to have the mode of the
file, instead of using `our_mode'?  So I should move the error handling
of ll_merge()/xdl_merge() and the detection of the permission conflict
before writing in the file, and call open(…, our_mode)?

I'm also not sure why we need adjust_shared_perm() here.

>> +	if (ret != 0 || !orig_blob)
>> +		ret = error(_("content conflict in %s"), path);
>> +	if (our_mode != their_mode)
>> +		return error(_("permission conflict: %o->%o,%o in %s"),
>> +			     orig_mode, our_mode, their_mode, path);
>> +	if (ret)
>> +		return -1;
>> +
>> +	return add_file_to_index(istate, path, 0);
>> +}
>> +
>> +int merge_strategies_one_file(struct repository *r,
>> +			      const struct object_id *orig_blob,
>> +			      const struct object_id *our_blob,
>> +			      const struct object_id *their_blob, const char *path,
>> +			      unsigned int orig_mode, unsigned int our_mode,
>> +			      unsigned int their_mode)
>> +{
> 
> In a long if/else if/else if/.../else cascade, enclose all bodies in
> braces, if any one of them has a multi-statement body, to avoid
> being distracting.
> 
>> +	if (orig_blob &&
>> +	    ((!their_blob && our_blob && oideq(orig_blob, our_blob)) ||
>> +	     (!our_blob && their_blob && oideq(orig_blob, their_blob))))
>> +		/* Deleted in both or deleted in one and unchanged in the other. */
>> +		return merge_one_file_deleted(r->index,
>> +					      orig_blob, our_blob, their_blob, path,
>> +					      orig_mode, our_mode, their_mode);
> 
> OK, we've already reviewed that function.
> 
>> +	else if (!orig_blob && our_blob && !their_blob) {
>> +		/*
>> +		 * Added in one.  The other side did not add and we
>> +		 * added so there is nothing to be done, except making
>> +		 * the path merged.
>> +		 */
>> +		return add_to_index_cacheinfo(r->index, our_mode, our_blob, path);
> 
> OK, we've already reviewed that function.
> 
>> +	} else if (!orig_blob && !our_blob && their_blob) {
>> +		printf(_("Adding %s\n"), path);
>> +
>> +		if (file_exists(path))
>> +			return error(_("untracked %s is overwritten by the merge."), path);
>> +
>> +		if (add_to_index_cacheinfo(r->index, their_mode, their_blob, path))
>> +			return -1;
>> +		return checkout_from_index(r->index, path);
> 
> You did "add_to_index_cacheinfo()", so you MUST know which ce is to
> be checked out.
> 
> Consider if it is worth to teach add_to_index_cacheinfo() to give
> you ce back and pass it to checkout_from_index(); that way, you do
> not have to call index_file_exists() based on path in the function.
> 

OK, this is doable.

>> +	} else if (!orig_blob && our_blob && their_blob &&
>> +		   oideq(our_blob, their_blob)) {
>> +		/* Added in both, identically (check for same permissions). */
>> +		if (our_mode != their_mode)
>> +			return error(_("File %s added identically in both branches, "
>> +				       "but permissions conflict %o->%o."),
>> +				     path, our_mode, their_mode);
>> +
>> +		printf(_("Adding %s\n"), path);
>> +
>> +		if (add_to_index_cacheinfo(r->index, our_mode, our_blob, path))
>> +			return -1;
>> +		return checkout_from_index(r->index, path);
> 
> Ditto.
> 
>> +	} else if (our_blob && their_blob)
>> +		/* Modified in both, but differently. */
>> +		return do_merge_one_file(r->index,
>> +					 orig_blob, our_blob, their_blob, path,
>> +					 orig_mode, our_mode, their_mode);
>> +	else {
>> +		char orig_hex[GIT_MAX_HEXSZ] = {0}, our_hex[GIT_MAX_HEXSZ] = {0},
>> +			their_hex[GIT_MAX_HEXSZ] = {0};
>> +
>> +		if (orig_blob)
>> +			oid_to_hex_r(orig_hex, orig_blob);
>> +		if (our_blob)
>> +			oid_to_hex_r(our_hex, our_blob);
>> +		if (their_blob)
>> +			oid_to_hex_r(their_hex, their_blob);
>> +
>> +		return error(_("%s: Not handling case %s -> %s -> %s"),
>> +			     path, orig_hex, our_hex, their_hex);
>> +	}
>> +
>> +	return 0;
>> +}
> 
> I can see that this does go in the right direction.  With a bit more
> attention to details it would soon be production-ready quality.
> 
> Thanks.
> 

Thank you,
Alban