Re: Git Rename Detection Bug

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Elijah,

Many thanks for your reply, the detail is much appreciated.  I was aware, from recently read articles, that git doesn't record renames as such, hence my investigations into the rename detection, but I also found some interesting points in your email, such as the "git status --no-renames" flag.

I think the issue I'm encountering is described by what you say here:
"Exact renames are detected first, before any other method of rename
detection is employed, and other than giving a preference to files
with the same basename, if there are multiple choices it just picks
one (what I'd call at random, though technically based on what the
internal processing order happens to be)"

That is close to the behaviour I'm seeing.  As I mentioned, git seems to think a file has been deleted and then as it continues to detect renames, it's as if it is going through lists of "Local-Base" and "Base-Remote" changes trying to match them up, but the directories of the files being matched are "offset", as highlighted by this list of mismatches:

(I'd put the paths in a table for easier analysis, but for some reason the emails need to be plain text)
> Incorrect path match: Landscape/Deployment/PowershellScripts/pre-req/Rdt.BatchProcessingService.Setup/pre-req.ps1 -> Landscape/src/Deployment/PowershellScripts/pre-req/Landscape.Net/pre-req.ps1
> Incorrect path match: Landscape/Deployment/PowershellScripts/pre-req/Workflow/pre-req.ps1 -> Landscape/src/Deployment/PowershellScripts/pre-req/Rdt.BatchProcessingService.Setup/pre-req.ps1
> Incorrect name match: Landscape/Documentation/Rdt.Documentation.UI/Properties/licenses.licx -> Landscape/src/Deployment/PowershellScripts/pre-req/Workflow/pre-req.ps1
> Incorrect path match: Landscape/Documentation/uiDocumentation/licenses.licx -> Landscape/src/Documentation/Rdt.Documentation.UI/Properties/licenses.licx
> Incorrect path match: Landscape/Import/uiImport/My Project/licenses.licx -> Landscape/src/Documentation/uiDocumentation/licenses.licx
> Incorrect path match: Landscape/Main/uiMain.Workflow/My Project/licenses.licx -> Landscape/src/Import/uiImport/My Project/licenses.licx
> Incorrect path match: Landscape/Main/uiMain/My Project/licenses.licx -> Landscape/src/Main/uiMain.Workflow/My Project/licenses.licx

Given git compares both the content and the directory\filenames, and as the repositories have unrelated histories, the "Base" file is going to be empty, therefore, even if Local and Remote are identical, they are both 100% different to Base.  That given, I'm not sure why git would state that Landscape/Documentation/Rdt.Documentation.UI/Properties/licenses.licx and Landscape/src/Deployment/PowershellScripts/pre-req/Workflow/pre-req.ps1 are a "both added" conflict given their file names and paths are completely different.  Any ideas?

I wrote a script to resolve the conflicts best I can which categorises the files into sets according to the file status (i.e. "added by them", "added by us" etc), and then either does a "git checkout head -- <file>" or a "git rm <file>" based upon which set the file is in and whether it is in another set or not.  This has worked really well and helped me through the large changeset with 3k conflicts.

As git only needs to try and match files in the "deleted by us" and "deleted by them" sets (although including the "deleted in both" set would allow matching renames/moves on both sides), an idea for a potential improvement to the matching algorithm (where you say there's a comment "too many alternatives, pick one") could be to compute a "difference value" for the path\filename of those files in one of the other sets (i.e. "added by us", "added by them" or "added in both"), and chose a potential rename based upon the smallest calculated difference.  The difference value would be the number of differences in folder names, e.g.

deleted in both: Landscape/Deployment/PowershellScripts/pre-req/Rdt.BatchProcessingService.Setup/pre-req.ps1

added in both: Landscape/src/Deployment/PowershellScripts/pre-req/Landscape.Net/pre-req.ps1
(path\name difference = 2)
added in both: Landscape/src/Deployment/PowershellScripts/pre-req/Rdt.BatchProcessingService.Setup/pre-req.ps1
(path\name difference = 1)
added in both: Landscape/src/Deployment/PowershellScripts/pre-req/Workflow/pre-req.ps1
(path\name difference = 2)

So, given the above, git would chose the second "added in both" entry.

Food for thought?  Happy to discuss the idea further.

Regards,

Jeremy Pridmore
Lead Solution Architect


From: Elijah Newren <newren@xxxxxxxxx>
Sent: 07 November 2023 08:05
To: Jeremy Pridmore <jpridmore@xxxxxxxxx>
Cc: git@xxxxxxxxxxxxxxx <git@xxxxxxxxxxxxxxx>; Paul Baumgartner <pbaumgartner@xxxxxxxxx>
Subject: Re: Git Rename Detection Bug

[You don't often get email from newren@xxxxxxxxx. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]

Caution: This email originated from outside of the organisation. Please treat any attachments or links with caution. If in doubt please contact IT


Hi,

On Mon, Nov 6, 2023 at 4:01 AM Jeremy Pridmore <jpridmore@xxxxxxxxx> wrote:
>
> Thank you for filling out a Git bug report!
> Please answer the following questions to help us understand your issue.

I think it might be worthwhile to point out a few facts about rename
handling in Git, as background information that might clarify a few
things about how Git's mental model seems to differ from yours:

  * In git, renames are not tracked; they are detected (based on file
similarity of the commits being compared).
  * So, when you run "git mv A B", there is no rename recorded.  It's
basically the same as "git rm A", followed by creating B with the same
contents, followed by "git add B".
  * The detection happens whenever an operation (diff, log -p, merge,
status, etc.) needs or wants to know about renames.
  * In git, directory renames are detected _after_ normal renames, and
via amalgamation of the individual renames.
  * As a corollary of the last item, the only way individual renames
can be affected by directory renames, is if the individual rename on
one side of history was into a directory that the other side of
history renamed away; in such a case, we apply an _extra_ rename to
move it into the new directory.  But we don't "undo" individual
renames to make them fit the majority-determined directory rename or
anything like that.

(Also, if it matters, all of this is true of both `recursive` and
`ort` merge strategies, i.e. the old default merge backend and the new
one.)

> What did you do before the bug happened? (Steps to reproduce your issue)
> I have two GIT repositories (A and B). Both migrated from the same TFS server using git-tfs tool. I migrated code into A and made lots of changes, including moving 50,000+ files from folder "/Landscape" to "/Landscape/src".  B contains the same code but with various other changes made since my original migration from TFS to A.  All the files in B are still in the "/Landscape" folder.  I recently needed to merge my changes from A to B, so I added A as a remote to B and then performed a number of cherry-picks from A to B, but got stuck when trying to cherry-pick the commit containing the results of moving all files into "/Landscape/src".

In case anyone else wants to dig into this, note that this problem
setup precludes directory rename detection being involved.  Directory
rename detection has a rule where if the source directory wasn't
entirely removed on one side, then that directory was not renamed on
that side.  Seems obvious, but the upshot of that rule is that a
directory cannot be renamed into a subdirectory of itself, because by
virtue of being a subdirectory that means its parent directory still
exists.

So, this is a problem where only regular rename detection (i.e. rename
detection of individual files) is going to be at play.

> What did you expect to happen? (Expected behavior)
> I expected the git rename detection to match all files in A "/Landscape" to files in B "/Landscape/src".

Are all files under "/Landscape" from the merge base commit
individually more similar to the counterpart under "/Landscape/src"
than to files under any other directory?  If not, the expectation goes
against how rename detection has worked in git from the beginning.

> What happened instead? (Actual behavior)
> Although many files were matched successfully, git mismatched over two dozen similarly named files, e.g.
>
> Incorrect path match: Landscape/Services/uiServices/Complaints/Interfaces/IAccountsIntegration.vb -> Landscape/src/Complaints/Rdt.Complaints.UI/Interfaces/IAccountsIntegration.vb
> Incorrect path match: Landscape/Services/uiServices/Complaints/Interfaces/IDocumentIntegration.vb -> Landscape/src/Complaints/Rdt.Complaints.UI/Interfaces/IDocumentIntegration.vb
> Incorrect path match: Landscape/Deployment/PowershellScripts/pre-req/Rdt.BatchProcessingService.Setup/pre-req.ps1 -> Landscape/src/Deployment/PowershellScripts/pre-req/Landscape.Net/pre-req.ps1
> Incorrect path match: Landscape/Deployment/PowershellScripts/pre-req/Workflow/pre-req.ps1 -> Landscape/src/Deployment/PowershellScripts/pre-req/Rdt.BatchProcessingService.Setup/pre-req.ps1
> Incorrect name match: Landscape/Documentation/Rdt.Documentation.UI/Properties/licenses.licx -> Landscape/src/Deployment/PowershellScripts/pre-req/Workflow/pre-req.ps1
> Incorrect path match: Landscape/Documentation/uiDocumentation/licenses.licx -> Landscape/src/Documentation/Rdt.Documentation.UI/Properties/licenses.licx
> Incorrect path match: Landscape/Import/uiImport/My Project/licenses.licx -> Landscape/src/Documentation/uiDocumentation/licenses.licx
> Incorrect path match: Landscape/Main/uiMain.Workflow/My Project/licenses.licx -> Landscape/src/Import/uiImport/My Project/licenses.licx
> Incorrect path match: Landscape/Main/uiMain/My Project/licenses.licx -> Landscape/src/Main/uiMain.Workflow/My Project/licenses.licx
> Incorrect path match: Landscape/LandscapeApiService.Setup/Setup/UIContent/RDT_Logo.ico -> Landscape/src/Main/uiMain.Workflow/Resources/RDT_Logo.ico
> Incorrect path match: Landscape/Policy/Rdt.Policy.UI.Templates/Properties/licenses.licx -> Landscape/src/Main/uiMain/My Project/licenses.licx
> Incorrect path match: Landscape/Main/uiMain.Workflow/Resources/RDT_Logo.ico -> Landscape/src/Main/uiMain/Resources/RDT_Logo.ico
> Incorrect path match: Landscape/Policy/Rdt.Policy.UI/Properties/licenses.licx -> Landscape/src/Policy/Rdt.Policy.UI.Templates/Properties/licenses.licx
> Incorrect path match: Landscape/Rates/uiRates/My Project/licenses.licx -> Landscape/src/Policy/Rdt.Policy.UI/Properties/licenses.licx
> Incorrect path match: Landscape/Rdt.Claim.UI/Properties/licenses.licx -> Landscape/src/Rates/uiRates/My Project/licenses.licx
> Incorrect path match: Landscape/Rdt.Landscape.UI.Templates.Workflow/Properties/licenses.licx -> Landscape/src/Rdt.Claim.UI/Properties/licenses.licx
> Incorrect path match: Landscape/Rdt.Landscape.UI.Templates/Properties/licenses.licx -> Landscape/src/Rdt.Landscape.UI.Templates.Workflow/Properties/licenses.licx
> Incorrect path match: Landscape/Rdt.Landscape.UI.Workflow/Properties/licenses.licx -> Landscape/src/Rdt.Landscape.UI.Templates/Properties/licenses.licx
> Incorrect path match: Landscape/Rdt.Landscape.UI/Properties/licenses.licx -> Landscape/src/Rdt.Landscape.UI.Workflow/Properties/licenses.licx
> Incorrect path match: Landscape/StandardLetters/uiStandardLetters/My Project/licenses.licx -> Landscape/src/Rdt.Landscape.UI/Properties/licenses.licx
> Incorrect path match: Landscape/Complaints/Rdt.Complaints.UI/Interfaces/IDocumentIntegration.vb -> Landscape/src/Services/uiServices/Complaints/Interfaces/IDocumentIntegration.vb
> Incorrect path match: Landscape/SystemEvents/uiSystemEvents/My Project/licenses.licx -> Landscape/src/StandardLetters/uiStandardLetters/My Project/licenses.licx
> Incorrect path match: Landscape/Services/busServices/RDT_Logo.ico -> Landscape/src/Startup/uiStartup.Workflow/Resources/RDT_Logo.ico
> Incorrect path match: Landscape/Startup/uiStartup.Workflow/Resources/RDT_Logo.ico -> Landscape/src/Startup/uiStartup/Resources/RDT_Logo.ico
> Incorrect path match: Landscape/Startup/uiStartup/Resources/RDT_Logo.ico -> Landscape/src/Startup/uiStartup32/RDT_Logo.ico
> Incorrect path match: Landscape/Startup/uiStartup/Resources/newrdlogogradiant48shad.ico -> Landscape/src/Startup/uiStartup32/newrdlogogradiant48shad.ico
> Incorrect path match: Landscape/Templates/uiTemplates.Workflow/My Project/licenses.licx -> Landscape/src/SystemEvents/uiSystemEvents/My Project/licenses.licx
> Incorrect path match: Landscape/Utils/Rdt.Utils.UI/Properties/licenses.licx -> Landscape/src/Templates/uiTemplates.Workflow/My Project/licenses.licx
> Incorrect path match: Landscape/Utils/uiUtils/My Project/licenses.licx -> Landscape/src/Utils/Rdt.Utils.UI/Properties/licenses.licx
> Incorrect name match: Landscape/WebServices/ServiceFabric/Policy/Rdt.Policy.Repository.Service.Fabric.Host/PackageRoot/Data/Swagger/Examples/POST_UKSTasks_Response.json -> Landscape/src/Utils/uiUtils/My Project/licenses.licx
>
>
> What's different between what you expected and what actually happened?
>
> As you can see, although the filenames (and content) are the same,

The content is the same as well?  So, these renames that you label as
incorrect are actually _exact_ renames -- and further, in most cases
they also have an identical basename for the file as well.

Exact renames are detected first, before any other method of rename
detection is employed, and other than giving a preference to files
with the same basename, if there are multiple choices it just picks
one (what I'd call at random, though technically based on what the
internal processing order happens to be; see the "Too many identical
alternatives? Pick one" code comment).

And this, too, is true of both the `recursve` and `ort` backends; no
change has been made to how exact renames are handled.

>  In some cases, it seems that the catalyst has been git thinking that a file from B has been deleted from A, when in fact it has not actually been deleted at all.
>
> For example, the file Landscape/Deployment/PowershellScripts/pre-req/Landscape.Net/pre-req.ps1 has not been deleted in A or B, therefore git should not have attempted to rename Landscape/Deployment/PowershellScripts/pre-req/Rdt.BatchProcessingService.Setup/pre-req.ps1 to Landscape/Deployment/PowershellScripts/pre-req/Landscape.Net/pre-req.ps1, especially as it then attempts to rename Landscape/Deployment/PowershellScripts/pre-req/Workflow/pre-req.ps1 to Landscape/src/Deployment/PowershellScripts/pre-req/Rdt.BatchProcessingService.Setup/pre-req.ps1 and so on.

Renamed files, from Git's perspective, always involve files that have
been deleted.

> Git status contains, for example:
>         deleted by them: Landscape/Deployment/PowershellScripts/pre-req/Landscape.Net/pre-req.ps1

This means that it wasn't sufficiently similar to any of the new
files...or that _other_ deleted files were more similar to the new
files and thus that they were paired up instead of this file, leaving
this file to simply be marked as deleted.  (Or that other deleted
files were just as similar; tie-breakers are kinda random in such a
case.)

[...]

> Anything else you want to add:
> I can't help but think that this is related to changes made by Palantir:
> https://blog.palantir.com/optimizing-gits-merge-machinery-1-127ceb0ef2a1

Curious.  What makes you think it's related?

If there is some reason you think it's related, there's an easy way to
check -- just repeat the cherry-pick with the "-s recursive" flag to
use the old merge backend and compare the results.

I'll be somewhat surprised if it's related, though.

> I have tried to unstage these renames using "git restore --staged <file_name>" so I can then apply the correct "git mv" commands

Why?  Just modify all the files to have the correct end results and then commit.

>, but bizzarely, this then results in "git status" reporting a different, smaller set of mismatched names:

As mentioned earlier, git does _not_ record renames.  So, running the
correct "git mv" command doesn't really mean much.  If you use
completely "incorrect" git-mv commands, but then manually tweak files
until they have the correct results, then what's recorded is exactly
the same as if you had used the "correct" git-mv commands.

Further, when you run "git status", it can't access any renames you
did because that information isn't recorded anywhere.  It instead
recomputes renames on the fly.  And it does so each and every time you
run "git status", even if you make no changes between two invocations.
In fact, from this you can probably also deduce that there are other
ways to affect what will be shown as renames, when you have multiple
files similar to any given source file.  In particular, you can cause
a different pairing modifying one of the similar files enough that it
becomes the most similar to the source file, or so that it becomes no
longer the most similar to the source file.  However, what "git
status" reports for renames is irrelevant, since that info won't be
recorded in the commit.  Renames are never recorded.  Anywhere.

In fact, you can even run "git status --no-renames" to just see the
old filenames that were removed and the new ones that were added
without having all the files be paired up as renames.



Hope that helps,
Elijah

________________________________

DISCLAIMER This email is confidential. It should only be read by those persons to whom it is addressed. RDT Ltd accept no liability for the consequences of any person acting, or refraining from acting, on any information contained within this e-mail or any attached documents prior to the receipt by those persons of subsequent written confirmation of that information. If you think this e-mail may not be intended for you, do not use, pass on or copy the transmission in any way. While all reasonable precautions are taken to minimise the risk of transmitting software viruses we advise you to carry out your own virus checks on any attachment to this message. We cannot accept liability for any loss or damage caused by software viruses.




[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux