Hi Jeremy, On Fri, Nov 10, 2023 at 3:28 AM Jeremy Pridmore <jpridmore@xxxxxxxxx> wrote: > > Hi Elijah, > > Many thanks for your reply, the detail is much appreciated. I was aware, from recently read articles, that git doesn't record renames as such, hence my investigations into the rename detection, but I also found some interesting points in your email, such as the "git status --no-renames" flag. The fact that you were trying to "undo" renames and "redo the correct ones" suggested there's something you still didn't understand about rename detection, though. The fact that you were worried that "git status" showing the "wrong renames" and the implication that it needed to show the right ones before you committed also suggested there's something that was still not being understood. Likewise, the fact that the renames reported by "git status" change even when you haven't renamed files further but have simply made additional changes to the contents of some files suggests there is something that was still not being understood about the phrase that "renames are detected rather than recorded". While renames are used in the merge algorithm in order to know what files to match up for three-way content merges (so that changes from both sides to the "same" file can be incorporated into the end result), once the merge stops to ask you to resolve conflicts, the detection of renames doesn't matter beyond that point. The fact that renames aren't recorded means whatever renames are shown is only there as a guide to help you understand. All that matters to Git is that all files have the intended content. If some of the files have the wrong content, then by all means go and correct it. If "git mv" commands help you do that, great. If simply editing all files of interest (including adding and deleting files) until they match the expected contents works, that's fine too. Once all files have the correct content, commit it. Git will have no way whatsoever of knowing which of those two routes you picked, and won't behave any differently in the future based on which way you ended up with the right contents in each file. You could have even done the extreme of "git merge -s ours --no-commit ${OTHER_BRANCH}" (merge the other branch but completely ignore *every* change the other side made and then stop for user to make further changes), followed by opening and editing every relevant file to include changes made by the other side, deleting and adding files as relevant, to end up with the right contents in all files and then committed, and Git wouldn't know the difference and no one who pulled your merge commit would be able to tell the difference either. > I think the issue I'm encountering is described by what you say here: > "Exact renames are detected first, before any other method of rename > detection is employed, and other than giving a preference to files > with the same basename, if there are multiple choices it just picks > one (what I'd call at random, though technically based on what the > internal processing order happens to be)" > > That is close to the behaviour I'm seeing. As I mentioned, git seems to think a file has been deleted and then as it continues to detect renames, it's as if it is going through lists of "Local-Base" and "Base-Remote" changes trying to match them up, but the directories of the files being matched are "offset", as highlighted by this list of mismatches: > > (I'd put the paths in a table for easier analysis, but for some reason the emails need to be plain text) > > Incorrect path match: Landscape/Deployment/PowershellScripts/pre-req/Rdt.BatchProcessingService.Setup/pre-req.ps1 -> Landscape/src/Deployment/PowershellScripts/pre-req/Landscape.Net/pre-req.ps1 > > Incorrect path match: Landscape/Deployment/PowershellScripts/pre-req/Workflow/pre-req.ps1 -> Landscape/src/Deployment/PowershellScripts/pre-req/Rdt.BatchProcessingService.Setup/pre-req.ps1 > > Incorrect name match: Landscape/Documentation/Rdt.Documentation.UI/Properties/licenses.licx -> Landscape/src/Deployment/PowershellScripts/pre-req/Workflow/pre-req.ps1 > > Incorrect path match: Landscape/Documentation/uiDocumentation/licenses.licx -> Landscape/src/Documentation/Rdt.Documentation.UI/Properties/licenses.licx > > Incorrect path match: Landscape/Import/uiImport/My Project/licenses.licx -> Landscape/src/Documentation/uiDocumentation/licenses.licx > > Incorrect path match: Landscape/Main/uiMain.Workflow/My Project/licenses.licx -> Landscape/src/Import/uiImport/My Project/licenses.licx > > Incorrect path match: Landscape/Main/uiMain/My Project/licenses.licx -> Landscape/src/Main/uiMain.Workflow/My Project/licenses.licx > > Given git compares both the content and the directory\filenames, For exact renames, it only will look at filenames if there are multiple 100% matches. If one match is 100%, and the other 99.9999% match, the 100% match is taken. (Since we think that exact renames seem to be describing your problem, there's no point discussing the inexact rename handling.) > and as the repositories have unrelated histories, What do you mean by this? If there's no common point in history (i.e. no merge base), rename detection doesn't even get invoked, so you must mean something different by this phrase than what I would normally take it to mean. Any chance, if you're still in the middle of the merge, that you could run git merge-base --all HEAD MERGE_HEAD and report the output? It'll be commit hashes that don't mean a lot to anyone who doesn't have your repository, but I'm interested in how many commit hashes it responds with (0, 1, or more than 1). > the "Base" file is going to be empty, therefore, even if Local and Remote are identical, they are both 100% different to Base. I assume by "Base" you are referring to the commit in history common to both branches being merged (i.e. what we call the "merge base"), or to the relevant file from that commit. Is that right? Even if I'm right about that, this comment has me lost. Could you clarify it, with one particular example? For example (I'm making stuff up since I'm not familiar with your repo, but showing how to clarify for a given set of path(s)): * In this repository, with the renames detected above, Landscape/Main/somefile.licx is an empty file in the merge base. * On my local side, I renamed Landscape/Main/somefile.licx to Landscape/src/Main/somefile.licx and populated it with some content (with hash A). There is also another new file on the local side (at least new relative to the merge base) named Landscape/src/Other/somefile.licx that happens to have hash A. * On the remote side, Landscape/Main/somefile.licx was left in place but populated with some content (with hash A). * Git is detecting the rename as Landscape/Main/somefile.licx -> Landscape/src/Other/somefile.licx, when I wanted it to detect a rename to Landscape/src/Main/somefile.licx. I'm pretty sure this example is not what you're seeing, even if components of it are, because the empty file thing is impossible with the rest of the story. > That given, I'm not sure why git would state that Landscape/Documentation/Rdt.Documentation.UI/Properties/licenses.licx and Landscape/src/Deployment/PowershellScripts/pre-req/Workflow/pre-req.ps1 are a "both added" conflict given their file names and paths are completely different. Any ideas? The fact that Landscape/Documentation/Rdt.Documentation.UI/Properties/licenses.licx is marked as "both added" means that this file * did not exist in the merge base * did exist on your local side * did exist on the remote side * the version of this file on the local and remote sides do not match Basically, both sides added a new file and they are not the same, so you have a conflict. The answer for Landscape/src/Deployment/PowershellScripts/pre-req/Workflow/pre-req.ps1, is an identical set of bullet points; it was a file added by both sides of history that did not exist in the merge base, and the sides are different so you have a conflict in this file too. > I wrote a script to resolve the conflicts best I can which categorises the files into sets according to the file status (i.e. "added by them", "added by us" etc), and then either does a "git checkout head -- <file>" or a "git rm <file>" based upon which set the file is in and whether it is in another set or not. This has worked really well and helped me through the large changeset with 3k conflicts. So, you're simply throwing away the changes made by the remote side? I mean, that's one way to merge, and it might be right in your case, but to someone unfamiliar with your repo it smells like a hack to just ignore conflicts and throw away other people's changes in order to complete the merge. > As git only needs to try and match files in the "deleted by us" and "deleted by them" sets (although including the "deleted in both" set would allow matching renames/moves on both sides), "deleted by us" and "deleted by them" means no rename was detected for the file (at least on the side that the delete is reported for). So, a "deleted in both" only happens when neither side detects a rename, and if the file isn't renamed on either side and both removed it, then there's no conflict -- just delete the file. > an idea for a potential improvement to the matching algorithm (where you say there's a comment "too many alternatives, pick one") could be to compute a "difference value" for the path\filename of those files in one of the other sets (i.e. "added by us", "added by them" or "added in both"), and chose a potential rename based upon the smallest calculated difference. The difference value would be the number of differences in folder names, e.g. > > deleted in both: Landscape/Deployment/PowershellScripts/pre-req/Rdt.BatchProcessingService.Setup/pre-req.ps1 > > added in both: Landscape/src/Deployment/PowershellScripts/pre-req/Landscape.Net/pre-req.ps1 > (path\name difference = 2) > added in both: Landscape/src/Deployment/PowershellScripts/pre-req/Rdt.BatchProcessingService.Setup/pre-req.ps1 > (path\name difference = 1) > added in both: Landscape/src/Deployment/PowershellScripts/pre-req/Workflow/pre-req.ps1 > (path\name difference = 2) > > So, given the above, git would chose the second "added in both" entry. Three problems here: * If git were to report "deleted in both" for one path and "added in both" in another as you suggest, that would only be because the files are dissimilar. Not only would they not be an exact match, their contents would have less than 50% similarity. Thus augmenting exact matching like this just wouldn't work, because the files aren't matches at all. The only correct thing to do would be to not report the "deleted in both" because that file is not a rename of anything else and has simply been deleted by both sides. * filename similarity is extraordinarily expensive compared to exact renames, and if not carefully handled, can sometimes rival the cost of file content similarity computations given our spanhash representations. Exact renames are tasked with finding renames even if they are known to not be relevant, simply because exact renames can do so very quickly. If we change that, we throw a monkey wrench in our performance handling elsewhere and have to rethink a number of other things. * While I was optimizing rename detection while investigating the new merge backend, I actually attempted a few versions of filename similarity looking for something that was predictive and useful. While I think the idea was potentially helpful for some repositories, it has a significant risk of hurting merges in other repositories. While what I tried was far from an exhaustive checking of all filename similarity ideas, I came away doubting there was a useful heuristic other than exact matches of basenames (i.e. exact matches of everything in the filename after the final slash). If someone else wants to try more ideas, and do a study on various existing repositories, they can go ahead, but I suspect most work here is going to end up at a dead end and I'm unwilling to put further time of my own into it. > Food for thought? Happy to discuss the idea further. So, I've occasionally seen repositories that have something like the following: * base version: directory named library-x-1.7/ * stable branch: many changes to files under library-x-1.7/ * development branch: library-x-1.7/ no longer exists. However, library-x-1.8/ and library-x-1.9/ both do. Both are obviously "similar" to library-x-1.7/ but both have many changes. What happens when someone tries to merge the stable branch into the development branch? There are two obvious guesses: * Changes from library-x-1.7/ on the stable branch are applied to library-x-1.8/ * Changes from library-x-1.7/ on the stable branch are applied to library-x-1.9/ Either answer can be suboptimal depending on your viewpoint. (Applying the changes to both directories would also have other suboptimal effects even if it might sound right based for this exact problem as I've worded it. But git doesn't do copy detection as part of merges so Git won't choose this third choice ever.) So, which of those two happens? Well, since renames are detected based upon file similarity, the changes will go to whatever file is most similar. What does that mean? It means that both answers above are wrong. Instead: * Some of the changes from library-x-1.7/ on the stable branch are applied to files from library-x-1.8/, while others are applied to files from library-x-1.9/, and to determine which files from which directory are matched up is an individual file choice based on which file in library-x-1.8/ or library-x-1.9/ is most similar to the file from the base version in library-x-1.7/. This answer is clearly worse than either of the two above, and is virtually never what people would want. But it's also fundamental to the idea of matching up files and detecting renames individually based upon file similarity. It's part of both the old and new merge backends in Git, because both were based upon this fundamental idea. So, if you have this kind of situation, or even something like it where files from one old directory could match files from multiple other directories, it's just something you have to be aware of. All that said, here's something that might help: *** Hack to workaround rename detection in special cases where there are directories of multiple possible matches *** 1. Get back to a clean slate from before the merge. $ git merge --abort OR $ cd ${OTHER_DIRECTORY} && git clone ${url} && cd ${REPONAME} && git checkout ${relevant_branch} 2. Temporarily undo your local renames and make a temporary commit $ git mv Landscape/src/* Landscape/ $ git commit -m "TEMPORARY COMMIT" 3. Perform the merge (files will be in Landscape/ instead of Landscape/src/ for now). Don't worry, we'll fix the merge commit later. $ git merge ${REMOTE_BRANCH} [...fix up any conflicts and commit, if needed...] 4. Rename Landscape/ back to Landscape/src/ and make another (temporary) commit. 5. Create a corrected merge commit with the current tree, the commit message from your merge commit, and the correct parents: $ git commit-tree -p HEAD~3 -p HEAD~1^2 -F $(git log -1 --format=%B HEAD~1) HEAD^{tree} [...the above command will print out a new commit id for a corrected merge commit. You can inspect it first, but we just need to pass this to reset --hard...] 6. Reset your branch to this corrected merge commit (which will orphan the temporary commits from steps 2, 3, and 4 so they can later be garbage collected) $ git reset --hard [...output of commit-tree command...] Hope that helps, Elijah