Re: Git Rename Detection Bug

Elijah Newren <newren@xxxxxxxxx> · Fri, 10 Nov 2023 21:46:00 -0800

Hi Jeremy,

On Fri, Nov 10, 2023 at 3:28 AM Jeremy Pridmore <jpridmore@xxxxxxxxx> wrote:
>
> Hi Elijah,
>
> Many thanks for your reply, the detail is much appreciated.  I was aware, from recently read articles, that git doesn't record renames as such, hence my investigations into the rename detection, but I also found some interesting points in your email, such as the "git status --no-renames" flag.

The fact that you were trying to "undo" renames and "redo the correct
ones" suggested there's something you still didn't understand about
rename detection, though.  The fact that you were worried that "git
status" showing the "wrong renames" and the implication that it needed
to show the right ones before you committed also suggested there's
something that was still not being understood.  Likewise, the fact
that the renames reported by "git status" change even when you haven't
renamed files further but have simply made additional changes to the
contents of some files suggests there is something that was still not
being understood about the phrase that "renames are detected rather
than recorded".

While renames are used in the merge algorithm in order to know what
files to match up for three-way content merges (so that changes from
both sides to the "same" file can be incorporated into the end
result), once the merge stops to ask you to resolve conflicts, the
detection of renames doesn't matter beyond that point.  The fact that
renames aren't recorded means whatever renames are shown is only there
as a guide to help you understand.  All that matters to Git is that
all files have the intended content.  If some of the files have the
wrong content, then by all means go and correct it.  If "git mv"
commands help you do that, great.  If simply editing all files of
interest (including adding and deleting files) until they match the
expected contents works, that's fine too.  Once all files have the
correct content, commit it.  Git will have no way whatsoever of
knowing which of those two routes you picked, and won't behave any
differently in the future based on which way you ended up with the
right contents in each file.  You could have even done the extreme of
"git merge -s ours --no-commit ${OTHER_BRANCH}" (merge the other
branch but completely ignore *every* change the other side made and
then stop for user to make further changes), followed by opening and
editing every relevant file to include changes made by the other side,
deleting and adding files as relevant, to end up with the right
contents in all files and then committed, and Git wouldn't know the
difference and no one who pulled your merge commit would be able to
tell the difference either.

> I think the issue I'm encountering is described by what you say here:
> "Exact renames are detected first, before any other method of rename
> detection is employed, and other than giving a preference to files
> with the same basename, if there are multiple choices it just picks
> one (what I'd call at random, though technically based on what the
> internal processing order happens to be)"
>
> That is close to the behaviour I'm seeing.  As I mentioned, git seems to think a file has been deleted and then as it continues to detect renames, it's as if it is going through lists of "Local-Base" and "Base-Remote" changes trying to match them up, but the directories of the files being matched are "offset", as highlighted by this list of mismatches:
>
> (I'd put the paths in a table for easier analysis, but for some reason the emails need to be plain text)
> > Incorrect path match: Landscape/Deployment/PowershellScripts/pre-req/Rdt.BatchProcessingService.Setup/pre-req.ps1 -> Landscape/src/Deployment/PowershellScripts/pre-req/Landscape.Net/pre-req.ps1
> > Incorrect path match: Landscape/Deployment/PowershellScripts/pre-req/Workflow/pre-req.ps1 -> Landscape/src/Deployment/PowershellScripts/pre-req/Rdt.BatchProcessingService.Setup/pre-req.ps1
> > Incorrect name match: Landscape/Documentation/Rdt.Documentation.UI/Properties/licenses.licx -> Landscape/src/Deployment/PowershellScripts/pre-req/Workflow/pre-req.ps1
> > Incorrect path match: Landscape/Documentation/uiDocumentation/licenses.licx -> Landscape/src/Documentation/Rdt.Documentation.UI/Properties/licenses.licx
> > Incorrect path match: Landscape/Import/uiImport/My Project/licenses.licx -> Landscape/src/Documentation/uiDocumentation/licenses.licx
> > Incorrect path match: Landscape/Main/uiMain.Workflow/My Project/licenses.licx -> Landscape/src/Import/uiImport/My Project/licenses.licx
> > Incorrect path match: Landscape/Main/uiMain/My Project/licenses.licx -> Landscape/src/Main/uiMain.Workflow/My Project/licenses.licx
>
> Given git compares both the content and the directory\filenames,

For exact renames, it only will look at filenames if there are
multiple 100% matches.  If one match is 100%, and the other 99.9999%
match, the 100% match is taken.

(Since we think that exact renames seem to be describing your problem,
there's no point discussing the inexact rename handling.)

> and as the repositories have unrelated histories,

What do you mean by this?  If there's no common point in history (i.e.
no merge base), rename detection doesn't even get invoked, so you must
mean something different by this phrase than what I would normally
take it to mean.

Any chance, if you're still in the middle of the merge, that you could run
   git merge-base --all HEAD MERGE_HEAD
and report the output?  It'll be commit hashes that don't mean a lot
to anyone who doesn't have your repository, but I'm interested in how
many commit hashes it responds with (0, 1, or more than 1).

> the "Base" file is going to be empty, therefore, even if Local and Remote are identical, they are both 100% different to Base.

I assume by "Base" you are referring to the commit in history common
to both branches being merged (i.e. what we call the "merge base"), or
to the relevant file from that commit.  Is that right?

Even if I'm right about that, this comment has me lost.  Could you
clarify it, with one particular example?  For example (I'm making
stuff up since I'm not familiar with your repo, but showing how to
clarify for a given set of path(s)):

   * In this repository, with the renames detected above,
Landscape/Main/somefile.licx is an empty file in the merge base.
   * On my local side, I renamed Landscape/Main/somefile.licx to
Landscape/src/Main/somefile.licx and populated it with some content
(with hash A).  There is also another new file on the local side (at
least new relative to the merge base) named
Landscape/src/Other/somefile.licx that happens to have hash A.
   * On the remote side, Landscape/Main/somefile.licx was left in
place but populated with some content (with hash A).
   * Git is detecting the rename as Landscape/Main/somefile.licx ->
Landscape/src/Other/somefile.licx, when I wanted it to detect a rename
to Landscape/src/Main/somefile.licx.

I'm pretty sure this example is not what you're seeing, even if
components of it are, because the empty file thing is impossible with
the rest of the story.

>  That given, I'm not sure why git would state that Landscape/Documentation/Rdt.Documentation.UI/Properties/licenses.licx and Landscape/src/Deployment/PowershellScripts/pre-req/Workflow/pre-req.ps1 are a "both added" conflict given their file names and paths are completely different.  Any ideas?

The fact that Landscape/Documentation/Rdt.Documentation.UI/Properties/licenses.licx
is marked as "both added" means that this file
   * did not exist in the merge base
   * did exist on your local side
   * did exist on the remote side
   * the version of this file on the local and remote sides do not match
Basically, both sides added a new file and they are not the same, so
you have a conflict.

The answer for Landscape/src/Deployment/PowershellScripts/pre-req/Workflow/pre-req.ps1,
is an identical set of bullet points; it was a file added by both
sides of history that did not exist in the merge base, and the sides
are different so you have a conflict in this file too.

> I wrote a script to resolve the conflicts best I can which categorises the files into sets according to the file status (i.e. "added by them", "added by us" etc), and then either does a "git checkout head --
<file>" or a "git rm <file>" based upon which set the file is in and
whether it is in another set or not.  This has worked really well and
helped me through the large changeset with 3k conflicts.

So, you're simply throwing away the changes made by the remote side?
I mean, that's one way to merge, and it might be right in your case,
but to someone unfamiliar with your repo it smells like a hack to just
ignore conflicts and throw away other people's changes in order to
complete the merge.

> As git only needs to try and match files in the "deleted by us" and "deleted by them" sets (although including the "deleted in both" set would allow matching renames/moves on both sides),

"deleted by us" and "deleted by them" means no rename was detected for
the file (at least on the side that the delete is reported for).  So,
a "deleted in both" only happens when neither side detects a rename,
and if the file isn't renamed on either side and both removed it, then
there's no conflict -- just delete the file.

>  an idea for a potential improvement to the matching algorithm (where you say there's a comment "too many alternatives, pick one") could be to compute a "difference value" for the path\filename of those files in one of the other sets (i.e. "added by us", "added by them" or "added in both"), and chose a potential rename based upon the smallest calculated difference.  The difference value would be the number of differences in folder names, e.g.
>
> deleted in both: Landscape/Deployment/PowershellScripts/pre-req/Rdt.BatchProcessingService.Setup/pre-req.ps1
>
> added in both: Landscape/src/Deployment/PowershellScripts/pre-req/Landscape.Net/pre-req.ps1
> (path\name difference = 2)
> added in both: Landscape/src/Deployment/PowershellScripts/pre-req/Rdt.BatchProcessingService.Setup/pre-req.ps1
> (path\name difference = 1)
> added in both: Landscape/src/Deployment/PowershellScripts/pre-req/Workflow/pre-req.ps1
> (path\name difference = 2)
>
> So, given the above, git would chose the second "added in both" entry.

Three problems here:

* If git were to report "deleted in both" for one path and "added in
both" in another as you suggest, that would only be because the files
are dissimilar.  Not only would they not be an exact match, their
contents would have less than 50% similarity.  Thus augmenting exact
matching like this just wouldn't work, because the files aren't
matches at all.  The only correct thing to do would be to not report
the "deleted in both" because that file is not a rename of anything
else and has simply been deleted by both sides.

* filename similarity is extraordinarily expensive compared to exact
renames, and if not carefully handled, can sometimes rival the cost of
file content similarity computations given our spanhash
representations.  Exact renames are tasked with finding renames even
if they are known to not be relevant, simply because exact renames can
do so very quickly.  If we change that, we throw a monkey wrench in
our performance handling elsewhere and have to rethink a number of
other things.

* While I was optimizing rename detection while investigating the new
merge backend, I actually attempted a few versions of filename
similarity looking for something that was predictive and useful.
While I think the idea was potentially helpful for some repositories,
it has a significant risk of hurting merges in other repositories.
While what I tried was far from an exhaustive checking of all filename
similarity ideas, I came away doubting there was a useful heuristic
other than exact matches of basenames (i.e. exact matches of
everything in the filename after the final slash).  If someone else
wants to try more ideas, and do a study on various existing
repositories, they can go ahead, but I suspect most work here is going
to end up at a dead end and I'm unwilling to put further time of my
own into it.

> Food for thought?  Happy to discuss the idea further.

So, I've occasionally seen repositories that have something like the following:

  * base version: directory named library-x-1.7/
  * stable branch: many changes to files under library-x-1.7/
  * development branch: library-x-1.7/ no longer exists.  However,
library-x-1.8/ and library-x-1.9/ both do.  Both are obviously
"similar" to library-x-1.7/ but both have many changes.

What happens when someone tries to merge the stable branch into the
development branch?  There are two obvious guesses:

  * Changes from library-x-1.7/ on the stable branch are applied to
library-x-1.8/
  * Changes from library-x-1.7/ on the stable branch are applied to
library-x-1.9/

Either answer can be suboptimal depending on your viewpoint.
(Applying the changes to both directories would also have other
suboptimal effects even if it might sound right based for this exact
problem as I've worded it.  But git doesn't do copy detection as part
of merges so Git won't choose this third choice ever.)  So, which of
those two happens?  Well, since renames are detected based upon file
similarity, the changes will go to whatever file is most similar.
What does that mean?  It means that both answers above are wrong.
Instead:

  * Some of the changes from library-x-1.7/ on the stable branch are
applied to files from library-x-1.8/, while others are applied to
files from library-x-1.9/, and to determine which files from which
directory are matched up is an individual file choice based on which
file in library-x-1.8/ or library-x-1.9/ is most similar to the file
from the base version in library-x-1.7/.

This answer is clearly worse than either of the two above, and is
virtually never what people would want.  But it's also fundamental to
the idea of matching up files and detecting renames individually based
upon file similarity.  It's part of both the old and new merge
backends in Git, because both were based upon this fundamental idea.

So, if you have this kind of situation, or even something like it
where files from one old directory could match files from multiple
other directories, it's just something you have to be aware of.

All that said, here's something that might help:

*** Hack to workaround rename detection in special cases where there
are directories of multiple possible matches ***

1. Get back to a clean slate from before the merge.

$ git merge --abort
OR
$ cd ${OTHER_DIRECTORY} && git clone ${url} && cd ${REPONAME} && git
checkout ${relevant_branch}

2. Temporarily undo your local renames and make a temporary commit

$ git mv Landscape/src/* Landscape/
$ git commit -m "TEMPORARY COMMIT"

3. Perform the merge (files will be in Landscape/ instead of
Landscape/src/ for now).  Don't worry, we'll fix the merge commit
later.

$ git merge ${REMOTE_BRANCH}
[...fix up any conflicts and commit, if needed...]

4. Rename Landscape/ back to Landscape/src/ and make another (temporary) commit.

5. Create a corrected merge commit with the current tree, the commit
message from your merge commit, and the correct parents:
$ git commit-tree -p HEAD~3 -p HEAD~1^2 -F $(git log -1 --format=%B
HEAD~1) HEAD^{tree}
[...the above command will print out a new commit id for a corrected
merge commit.  You can inspect it first, but we just need to pass this
to reset --hard...]

6. Reset your branch to this corrected merge commit (which will orphan
the temporary commits from steps 2, 3, and 4 so they can later be
garbage collected)
$ git reset --hard [...output of commit-tree command...]

Hope that helps,
Elijah