Hi Junio, On 07/10/2020 00:01, Junio C Hamano wrote: > Alban Gruin <alban.gruin@xxxxxxxxx> writes: > >> This rewrites `git merge-one-file' from shell to C. This port is not >> completely straightforward: to save precious cycles by avoiding reading >> and flushing the index repeatedly, write temporary files when an >> operation can be performed in-memory, or allow other function to use the >> rewrite without forking nor worrying about the index,... > > So, the in-core index is still used, but when the contents of the in-core > index does not have to be written out disk, we just don't? Makes sense. > >> diff --git a/builtin/merge-one-file.c b/builtin/merge-one-file.c >> new file mode 100644 >> index 0000000000..598338ba16 >> --- /dev/null >> +++ b/builtin/merge-one-file.c >> @@ -0,0 +1,92 @@ >> +/* >> + * Builtin "git merge-one-file" >> + * >> + * Copyright (c) 2020 Alban Gruin >> + * >> + * Based on git-merge-one-file.sh, written by Linus Torvalds. >> + * >> + * This is the git per-file merge utility, called with >> + * >> + * argv[1] - original file SHA1 (or empty) >> + * argv[2] - file in branch1 SHA1 (or empty) >> + * argv[3] - file in branch2 SHA1 (or empty) > > Let's modernize this comment while we are at it. > > SHA1 -> "object name" (or "blob object name") > >> + * argv[4] - pathname in repository >> + * argv[5] - original file mode (or empty) >> + * argv[6] - file in branch1 mode (or empty) >> + * argv[7] - file in branch2 mode (or empty) >> + * >> + * Handle some trivial cases. The _really_ trivial cases have been >> + * handled already by git read-tree, but that one doesn't do any merges >> + * that might change the tree layout. >> + */ >> + >> +#define USE_THE_INDEX_COMPATIBILITY_MACROS >> +#include "cache.h" >> +#include "builtin.h" >> +#include "lockfile.h" >> +#include "merge-strategies.h" >> + >> +static const char builtin_merge_one_file_usage[] = >> + "git merge-one-file <orig blob> <our blob> <their blob> <path> " >> + "<orig mode> <our mode> <their mode>\n\n" >> + "Blob ids and modes should be empty for missing files."; >> + >> +static int read_mode(const char *name, const char *arg, unsigned int *mode) >> +{ >> + char *last; >> + int ret = 0; >> + >> + *mode = strtol(arg, &last, 8); >> + >> + if (*last) >> + ret = error(_("invalid '%s' mode: expected nothing, got '%c'"), name, *last); >> + else if (!(S_ISREG(*mode) || S_ISDIR(*mode) || S_ISLNK(*mode))) >> + ret = error(_("invalid '%s' mode: %o"), name, *mode); >> + >> + return ret; >> +} >> + >> +int cmd_merge_one_file(int argc, const char **argv, const char *prefix) >> +{ >> + struct object_id orig_blob, our_blob, their_blob, >> + *p_orig_blob = NULL, *p_our_blob = NULL, *p_their_blob = NULL; >> + unsigned int orig_mode = 0, our_mode = 0, their_mode = 0, ret = 0; >> + struct lock_file lock = LOCK_INIT; >> + >> + if (argc != 8) >> + usage(builtin_merge_one_file_usage); >> + >> + if (read_cache() < 0) >> + die("invalid index"); >> + >> + hold_locked_index(&lock, LOCK_DIE_ON_ERROR); >> + >> + if (!get_oid(argv[1], &orig_blob)) { >> + p_orig_blob = &orig_blob; >> + ret = read_mode("orig", argv[5], &orig_mode); >> + } > > argv[1] is defined as "either the object name of the blob in the > common ancestor, or an empty string". So you need to distinguish > three cases here, but you are only catching two. > > - argv[1] is an empty string; p_orig_blob can legitimately be left > NULL. > > - argv[1] is a valid blob object name. orig_blob should be > populated and p_orig_blob should point at it. > > - argv[1] is garbage, names a non-blob object, or there is no such > object with that name. Don't we want to catch it as a mistake? > > Also, when argv[1] is an empty string, argv[5] must also be an empty > string, or we got a wrong input---don't we want to catch it as a > mistake? > > The third case needs a bit of thought. For example, if $1 and $2 > are the same and points at a non-existent object, we know we won't > care because we only care about $3. In a lazily-cloned repository, > that may matter---we would not want to fail even if we not have blob > $1 and $2, as long as they are reasonably spelled a full hexadecimal > object name. But we would want to fail if blob object named by $3 > is missing. > > One way to achieve semantics closer to the above than the posted > patch may be to tighten the parsing. Instead of using "anything > goes" get_oid(), use get_oid_hex(), perhaps. > >> + if (!get_oid(argv[2], &our_blob)) { >> + p_our_blob = &our_blob; >> + ret = read_mode("our", argv[6], &our_mode); >> + } >> + >> + if (!get_oid(argv[3], &their_blob)) { >> + p_their_blob = &their_blob; >> + ret = read_mode("their", argv[7], &their_mode); >> + } >> + >> + if (ret) >> + return ret; >> + >> + ret = merge_strategies_one_file(the_repository, >> + p_orig_blob, p_our_blob, p_their_blob, argv[4], >> + orig_mode, our_mode, their_mode); > > That's a funny function name. It's not like the function will be > taught different strategy to handle the three-way merge, no? It > probably makes sense to name it after what it does, which is "three > way merge". > Okay. There's already a function called threeway_merge() in unpack_trees() that does something different. merge_strategies_threeway() should be good? >> + if (ret) { >> + rollback_lock_file(&lock); >> + return !!ret; >> + } >> + >> + return write_locked_index(&the_index, &lock, COMMIT_LOCK); >> +} > >> diff --git a/merge-strategies.c b/merge-strategies.c >> new file mode 100644 >> index 0000000000..bbe6f48698 >> --- /dev/null >> +++ b/merge-strategies.c >> @@ -0,0 +1,214 @@ >> +#include "cache.h" >> +#include "dir.h" >> +#include "ll-merge.h" >> +#include "merge-strategies.h" >> +#include "xdiff-interface.h" >> + > >> +static int add_to_index_cacheinfo(struct index_state *istate, >> + unsigned int mode, >> + const struct object_id *oid, const char *path) >> +{ >> + struct cache_entry *ce; >> + int len, option; >> + >> + if (!verify_path(path, mode)) >> + return error(_("Invalid path '%s'"), path); >> + >> + len = strlen(path); >> + ce = make_empty_cache_entry(istate, len); >> + >> + oidcpy(&ce->oid, oid); >> + memcpy(ce->name, path, len); >> + ce->ce_flags = create_ce_flags(0); >> + ce->ce_namelen = len; >> + ce->ce_mode = create_ce_mode(mode); >> + if (assume_unchanged) >> + ce->ce_flags |= CE_VALID; >> + option = ADD_CACHE_OK_TO_ADD | ADD_CACHE_OK_TO_REPLACE; >> + if (add_index_entry(istate, ce, option)) >> + return error(_("%s: cannot add to the index"), path); >> + >> + return 0; >> +} > > The above correctly does 'git update-index --add --cacheinfo "$6" > "$2" "$4"' but don't copy-and-paste existing code to do so. Add one > preliminary patch before everything else in the series to massage > and extract add_cacheinfo() function out of builtin/update-index.c, > move it to somewhere common like read-cache.c and so that we can > call it from here. > Hmm, I’d really like to do this, but I have one remark/question about it. In builtin/update-index.c, when add_cache_entry() fails, this message is printed: cannot add to the index - missing --add option? Obviously, this is not what we want to show in git-merge when add_index_entry() fails. But then, verify_path() can also fail, and will show a sensible message for any situation: Invalid path '%s' Should I return error when verify_path() fails, but eg. -2 in the case of add_index_entry(), and if this new add_cacheinfo() returns -2 but not -1, print the correct message? Or let the caller verify the path so it cannot fail because of this? >> +static int checkout_from_index(struct index_state *istate, const char *path) >> +{ >> + struct checkout state = CHECKOUT_INIT; >> + struct cache_entry *ce; >> + >> + state.istate = istate; >> + state.force = 1; >> + state.base_dir = ""; >> + state.base_dir_len = 0; >> + >> + ce = index_file_exists(istate, path, strlen(path), 0); > > This call is unfortunate for the reasons I mention later. > > But if you must have this call, then you need to sanity check what > you get from index_file_exists(). ce must be a merged cache entry, > so > > if (!ce || ce_stage(ce)) > BUG(...); > That’s ok, I managed to remove it following your advice. >> + if (checkout_entry(ce, &state, NULL, NULL) < 0) >> + return error(_("%s: cannot checkout file"), path); >> + return 0; >> +} >> + >> +static int merge_one_file_deleted(struct index_state *istate, >> + const struct object_id *orig_blob, >> + const struct object_id *our_blob, >> + const struct object_id *their_blob, const char *path, >> + unsigned int orig_mode, unsigned int our_mode, unsigned int their_mode) >> +{ >> + if ((our_blob && orig_mode != our_mode) || >> + (their_blob && orig_mode != their_mode)) >> + return error(_("File %s deleted on one branch but had its " >> + "permissions changed on the other."), path); >> + >> + if (our_blob) { >> + printf(_("Removing %s\n"), path); >> + >> + if (file_exists(path)) >> + remove_path(path); >> + } >> + >> + if (remove_file_from_index(istate, path)) >> + return error("%s: cannot remove from the index", path); >> + return 0; > > If the side that did not remove changed the mode, we don't silently > remove but fail and give a chance to inspect the situation to the > end user. If we had the blob and it is removed by them, we give a > message and only in that case we remove the file from the working > tree, together with any leading directory that has become empty. > > And after that we make sure that the path is no longer in the > index. The function removes entries for the path at all the stages, > which is exactly what we want. > > OK. > >> +} >> + >> +static int do_merge_one_file(struct index_state *istate, >> + const struct object_id *orig_blob, >> + const struct object_id *our_blob, >> + const struct object_id *their_blob, const char *path, >> + unsigned int orig_mode, unsigned int our_mode, unsigned int their_mode) >> +{ >> + int ret, i, dest; >> + ssize_t written; >> + mmbuffer_t result = {NULL, 0}; >> + mmfile_t mmfs[3]; >> + struct ll_merge_options merge_opts = {0}; >> + struct cache_entry *ce; >> + >> + if (our_mode == S_IFLNK || their_mode == S_IFLNK) >> + return error(_("%s: Not merging symbolic link changes."), path); >> + else if (our_mode == S_IFGITLINK || their_mode == S_IFGITLINK) >> + return error(_("%s: Not merging conflicting submodule changes."), path); >> + >> + read_mmblob(mmfs + 1, our_blob); >> + read_mmblob(mmfs + 2, their_blob); >> + >> + if (orig_blob) { >> + printf(_("Auto-merging %s\n"), path); >> + read_mmblob(mmfs + 0, orig_blob); >> + } else { >> + printf(_("Added %s in both, but differently.\n"), path); >> + read_mmblob(mmfs + 0, &null_oid); >> + } >> + >> + merge_opts.xdl_opts = XDL_MERGE_ZEALOUS_ALNUM; >> + ret = ll_merge(&result, path, >> + mmfs + 0, "orig", >> + mmfs + 1, "our", >> + mmfs + 2, "their", >> + istate, &merge_opts); > > Is it correct to call into ll_merge() here? The original used to > call "git merge-file" which called into xdl_merge(). Calling into > ll_merge() means the path is used to look up the attributes and use > the custom merge driver, which I am not offhand sure is what we want > to see at this low level (and if it turns out to be a good idea, we > definitely should explain the change of semantics in the proposed > log message for this commit). > >> + for (i = 0; i < 3; i++) >> + free(mmfs[i].ptr); >> + >> + if (ret < 0) { >> + free(result.ptr); >> + return error(_("Failed to execute internal merge")); >> + } >> + >> + /* >> + * Create the working tree file, using "our tree" version from >> + * the index, and then store the result of the merge. >> + */ > > The above is copied from the original, to explain what it did after > the comment, but it does not seem to match what the new code does. > >> + ce = index_file_exists(istate, path, strlen(path), 0); >> + if (!ce) >> + BUG("file is not present in the cache?"); >> + >> + unlink(path); >> + if ((dest = open(path, O_WRONLY | O_CREAT, ce->ce_mode)) < 0) { >> + free(result.ptr); >> + return error_errno(_("failed to open file '%s'"), path); >> + } >> + >> + written = write_in_full(dest, result.ptr, result.size); >> + close(dest); >> + >> + free(result.ptr); >> + >> + if (written < 0) >> + return error_errno(_("failed to write to '%s'"), path); >> + > > This open(..., ce->ce_mode) call is way insufficient. > > The comment we have above this part of the code talks about the > difficulty of doing this correctly in scripted version. Creating a > file by 'git checkout-index -f --stage=2 -- "$4"' and reusing it to > store the merged contents was the cleanest and easiest way without > having direct access to adjust_shared_perm() to create a working > tree file with the correct permission bits. > > We are writing in C, so we should be able to do much better than the > scripted version, as we can later call adjust_shared_perm(). > I'm not sure I understand the issue correctly. Is this because I fetch an entry from the index to have the mode of the file, instead of using `our_mode'? So I should move the error handling of ll_merge()/xdl_merge() and the detection of the permission conflict before writing in the file, and call open(…, our_mode)? I'm also not sure why we need adjust_shared_perm() here. >> + if (ret != 0 || !orig_blob) >> + ret = error(_("content conflict in %s"), path); >> + if (our_mode != their_mode) >> + return error(_("permission conflict: %o->%o,%o in %s"), >> + orig_mode, our_mode, their_mode, path); >> + if (ret) >> + return -1; >> + >> + return add_file_to_index(istate, path, 0); >> +} >> + >> +int merge_strategies_one_file(struct repository *r, >> + const struct object_id *orig_blob, >> + const struct object_id *our_blob, >> + const struct object_id *their_blob, const char *path, >> + unsigned int orig_mode, unsigned int our_mode, >> + unsigned int their_mode) >> +{ > > In a long if/else if/else if/.../else cascade, enclose all bodies in > braces, if any one of them has a multi-statement body, to avoid > being distracting. > >> + if (orig_blob && >> + ((!their_blob && our_blob && oideq(orig_blob, our_blob)) || >> + (!our_blob && their_blob && oideq(orig_blob, their_blob)))) >> + /* Deleted in both or deleted in one and unchanged in the other. */ >> + return merge_one_file_deleted(r->index, >> + orig_blob, our_blob, their_blob, path, >> + orig_mode, our_mode, their_mode); > > OK, we've already reviewed that function. > >> + else if (!orig_blob && our_blob && !their_blob) { >> + /* >> + * Added in one. The other side did not add and we >> + * added so there is nothing to be done, except making >> + * the path merged. >> + */ >> + return add_to_index_cacheinfo(r->index, our_mode, our_blob, path); > > OK, we've already reviewed that function. > >> + } else if (!orig_blob && !our_blob && their_blob) { >> + printf(_("Adding %s\n"), path); >> + >> + if (file_exists(path)) >> + return error(_("untracked %s is overwritten by the merge."), path); >> + >> + if (add_to_index_cacheinfo(r->index, their_mode, their_blob, path)) >> + return -1; >> + return checkout_from_index(r->index, path); > > You did "add_to_index_cacheinfo()", so you MUST know which ce is to > be checked out. > > Consider if it is worth to teach add_to_index_cacheinfo() to give > you ce back and pass it to checkout_from_index(); that way, you do > not have to call index_file_exists() based on path in the function. > OK, this is doable. >> + } else if (!orig_blob && our_blob && their_blob && >> + oideq(our_blob, their_blob)) { >> + /* Added in both, identically (check for same permissions). */ >> + if (our_mode != their_mode) >> + return error(_("File %s added identically in both branches, " >> + "but permissions conflict %o->%o."), >> + path, our_mode, their_mode); >> + >> + printf(_("Adding %s\n"), path); >> + >> + if (add_to_index_cacheinfo(r->index, our_mode, our_blob, path)) >> + return -1; >> + return checkout_from_index(r->index, path); > > Ditto. > >> + } else if (our_blob && their_blob) >> + /* Modified in both, but differently. */ >> + return do_merge_one_file(r->index, >> + orig_blob, our_blob, their_blob, path, >> + orig_mode, our_mode, their_mode); >> + else { >> + char orig_hex[GIT_MAX_HEXSZ] = {0}, our_hex[GIT_MAX_HEXSZ] = {0}, >> + their_hex[GIT_MAX_HEXSZ] = {0}; >> + >> + if (orig_blob) >> + oid_to_hex_r(orig_hex, orig_blob); >> + if (our_blob) >> + oid_to_hex_r(our_hex, our_blob); >> + if (their_blob) >> + oid_to_hex_r(their_hex, their_blob); >> + >> + return error(_("%s: Not handling case %s -> %s -> %s"), >> + path, orig_hex, our_hex, their_hex); >> + } >> + >> + return 0; >> +} > > I can see that this does go in the right direction. With a bit more > attention to details it would soon be production-ready quality. > > Thanks. > Thank you, Alban