Re: [RFC] Git rerere and non-conflicting changes during conflict resolution

Junio C Hamano <gitster@xxxxxxxxx> · Wed, 26 Jul 2017 00:14:10 -0700

Jeff King <peff@xxxxxxxx> writes:

> From the user's perspective, calling X "rerere" would probably be OK[1].
> But from an implementation perspective (and to keep the existing
> plumbing available and unchanged), it probably makes sense to call it
> something else, and have it run both rerere and a new plumbing command
> to do the merge-fix work (or call it nothing, and assume that users will
> either touch the plumbing directly or will use "git merge" to trigger
> both).
> ...
> I think it should be its own plumbing tool that merge calls alongside
> rerere. ;)

As long as we use the database keyed with <A,B> and take the merge
base into account, "git am" and "git cherry-pick" would not be able
to use the merge-fix machinery, so in that sense, calling X "rerere"
would not be OK, but I agree with your general sentiment about the
UI visible to the end users.  Just like "rerere" started with a
small step to avoid automating things too much and then later became
almost invisible for normal cases because we managed to automate it
reasonably well and integrate it into mergy operations, we may be
able to do the same for merge-fix machinery.  My "this belongs to
'merge'" is primarily coming from it---it might be reusable in other
mergy operations with some fundamental changes, but I envision that
the primarly and only user of that X would initially be 'merge'.

> Not having thought too hard about it yet, this containing relationship
> seems like the right direction. I guess you'd do the lookup by computing
> the merge-base M of <X,Y> (which we already know anyway), walking M..X
> and looking for any entries which mention those commits (in either A or
> B slots of the entry), and then similarly narrowing it according to
> M..Y.

Yes, every time I look at the Reintegrate script, I try to think of
an efficient implementation but do not find anything better than the
left-right walk over X...Y range, so that is my conclusion after
having thought about it very hard as well ;-)

> What if instead of commit hashes we used patch ids?

Now you may be onto something.  While we aim at the ultimately automated
ideal that would catch the maximal cases, for the earlier 'xyzzy
turns into frotz' example, the minimal cue to identify one side of
the pair that keys the "change this new instance of xyzzy into
frotz, too" merge-fix is a hunk like this:

    -const char *xyzzy;
    +const char *frotz;

It does not matter what other changes also appear in the same
commit, and my original "branch name" is way too broad, and my
previous "ideal" which is "a single problematic commit" is still
broader than necessary.  Well, patch-id identifies the entire change
contained in a single commit, so it is still too broad, but if we
can narrow it down to a single hunk like that, perhaps we can use it
as one side of the key.

And the other side of the key is naturally a hunk like this:

    +	printf("%s\n", xyzzy);

i.e. a new use of xyzzy appears where it didn't exist before.  And
when we merge two branches, one of which has one half of the key
(i.e. "const char *xyzzy;" turned into "const char *frotz"), and the
other has the other half of the key (i.e. "printf xyzzy" is added),
then we'd apply a patch that tells us to do this:

    -	printf("%s\n", xyzzy);
    +	printf("%s\n", frotz);

i.e. that patch would be the value in the database keyed by the pair
of two previous hunks.

> I think it's asking a lot for users to handle the textual conflicts and
> semantic ones separately. It would be nice if we could tell them apart
> automatically (and I think we can based on what isn't part of the
> conflict resolution).

After thinking about this a bit more, I realized that it is possible
to mechanically sift a human generated resolution that has both
textual conflict resolution (which will be handled by "rerere") and
semantic one (which needs the merge-fix machinery) into two, without
requiring the user to make two separate commits.

The key trick is that "rerere" is capable of recording and replaying
semantic conflict resolution made to a file that happens to have
textual conflict just fine.  Because rerere database records the
preimage (i.e. with conflict markers) and postimage (i.e. the final
resolution for the file) as an entire file contents, it can do 3-way
merge for parts of the same file that is away from any conflicted
region.

If you tell contrib/rerere-train.sh to learn from "pu^{/^Merge
branch 'bp/fsmonitor' into pu}", you'll see what I mean.

  $ git show "pu^{/^Merge branch 'bp/fsmonitor' into pu}" compat/bswap.h

shows the result of such resolution.  Early part of the combined
diff shown by the above command comes from textual conflict
resolution, but there is a new implementation of inline version of
get_be64 that has become necessary but did not exist in either of
the branches getting merged, which is shown as an evil merge.

So the way to automatically sift an existing merge into textual
conflict resolution (i.e. automatable with "rerere") and semantic
evil merge-fix is to first try to recreate the merge and enumerate
the paths that get conflicts.  Then resolve these paths by grabbing
the resolved result for these paths that get textual conflicts out
of the original merge.  That gives us the textual part (we can run
"git rerere" at this point to store the resolution away to the
rerere database---which is essentially what contrib/rerere-train
does).

There will be difference between the result of the above and the
original merge commit.  The paths that are different are all outside
these conflicted paths, and they are what we want in the second
commit, i.e. the semantic evil merge-fix, which will be the value
for the merge-fix database.

So that part is easily automatable.  

It is much harder to come up with a way to index into the merge-fix
database.  We can "punt" and use the same index scheme as the
Reintegrate script (i.e. key it with the name of the branch being
merged, which can be read from the original merge) as the initial
approximation, and that would already be much better than what
Reintegrate script currently does, in that the recording part is
much more automated (in the workflow I use the Reintegrate script,
the side that records a merge-fix initially is entirely manual).