Okay, you nerd-sniped me into responding why I'm super skeptical of this path... On Sat, Jul 30, 2022 at 7:44 AM René Scharfe <l.s.r@xxxxxx> wrote: > > Am 30.07.22 um 04:16 schrieb Elijah Newren: > > On Fri, Jul 29, 2022 at 1:34 PM René Scharfe <l.s.r@xxxxxx> wrote: > >> > >> Am 28.07.22 um 19:11 schrieb Junio C Hamano: > >>> Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> writes: > >>> > >>>> On Thu, Jul 28 2022, Laďa Tesařík wrote: > >>>> > >>>>> 1. I added a file called 'new_file' to a master branch. > >>>>> 2. Then I created branch feature/2 and deleted the file in master > >>>>> 3. Then I deleted the file in branch feature/2 as well. > >>>>> 4. I created 'new_file' on branch feature/2 again. > >>> > >>> It heavily depends on how this creation is done, i.e. what went into > >>> the created file. Imagine that a file existed with content A at > >>> commit 0, both commits 1 and 2 removed it on their forked history, > >>> and then commit 3 added exactly the same content A to the same path: > >>> > >>> 1---3 > >>> / \ > >>> ----0---2---4----> > >>> > >>> When you are about to merge 2 and 3 to create 4, what would a > >>> three-way merge see? > >>> > >>> 0 had content A at path P > >>> 2 said "no we do not want content A at path P" > >>> 3 said "we are happy with content A at path P" > >>> > >>> So the net result is that 0-->3 "one side did not touch A at P" and > >>> 0-->2 "one side removed A at P". > >>> > >>> Three-way merge between X and Y is all about taking what X did if Y > >>> didn't have any opinion on what X touched. This is exactly that > >>> case. The history 0--->3 didn't have any opinion on what should be > >>> in P or whether P should exist, and that is why there is no change > >>> between these two endpoints. > >> > >> The last sentence is not necessarily true. You could also say that > >> 0--->3 cared so much about path P having content A that it brought it > >> back from the void. Determining whether a de-facto revert > >> - intended to return to an uncaring state of "take whatever main has" or > >> - meant to choose *that* specific content which incidentally is on main > >> is not possible from the snapshots at the merge point alone, I think. > >> > >> Checking if 0...3 touched P and leaving that path unmerged out of > >> caution shouldn't be terribly expensive. > > > > I think it might be terribly expensive. > > > > Walking history can easily be the slow part of such an operation, e.g. > > can_fast_forward() taking roughly 100 times as long as doing the > > merge_incore_recursive() portion that creates the new merged toplevel > > tree[1]. (And can_fast_forward() is a form of history walk that > > doesn't involve traversing into any trees, so I suspect it's a cheaper > > history traversal than what is being suggested). > > > > Focusing on the tree traversal side, this suggested change would > > essentially disable the trivial directory resolution optimizations in > > merge-ort[2]. (Note that the trivial directory resolution sped up a > > rebase that didn't involve very many renames by a factor of 25). The > > whole point of that optimization was to avoid walking into trees that > > were only changed on one side, where possible. Your proposed change > > would be saying we always have to walk into trees that either side > > modified...and do so for every intermediate commit as well so that we > > can fully enumerate all (temporarily) changed files. > > True: Compared to just checking if a path was touched by 3, a history > traversal can take arbitrarily long. At least it's bounded by the merge > base and a specific path. How is it bounded by a specific path? How do you know which path to check? Can't this affect many paths? I think to fix this issue, you either have to walk all paths for all intermediate commits, OR: * get the list of paths modified between the merge base and one side of history (just an endpoint diff, not a view of history along the way) * get the list of paths modified between the merge base and the other side of history (again, just an endpoint diff) * take the symmetric difference of those two lists above (let's name the resulting set SD so we can refer to it later). * Walk all intermediate commits between the merge base and the two endpoints being merged, and for each one investigate every path in it corresponding to one of the paths in SD. * As we do the above walk into trees of all intermediate commits, look for a sequence of commits that does some change and then later reverts it Also, I think using the phrase "the merge base" points out some ill-posedness in this problem. For actual merges, I think it's well posed. But... What about cherry-picks, rebases, and reverts? Those pass a funny merge-base to the merge machinery. Using that merge-base does allow us to construct a valid commit range, but in that context, is that range still correct to use with your rule? Perhaps it does make sense to use that range and apply your same rule, but... What if we instead cherry-pick a different way -- using format-patch and `git am -3`? At the time git-am runs, it only has one commit available to it. It cannot determine any merge-base. So what range of commits should it walk to see if the upstream side has both made a change to some file and then undone that change? Do we walk all of history? Do we attempt to find some point in history that appears to have the same versions of the files in the "from" versions of the patch being applied? What if such a commit does not exist? What if many such commits exist? If many such commits exist, what if using one of those commits would give us a different answer than using another commit (meaning, what if we do see a change-and-then-unchange if one of them is used as the merge-base, but do not see such a sequence if the other one is)? I'm certain I could construct examples of each of those "What if" scenarios for git-am. > And renames complicate the picture, but only > full renames (same blob or tree ID) need to be considered. That feels > doable in a reasonable amount of time, but it's not as cheap as ignoring > the history. Making things 25x slower and calling it "not as cheap"? That's some literary jiu-jitsu there. Also, I think the 25x figure may well undercount the expense, perhaps even dramatically, since I'm not sure things can be limited in the form you envision... > Assuming that one side doesn't care about a path because it has the same > content as the merge base is tempting. And reverts that break this > assumption are probably quite rare. Still it led to an unintended > outcome here. Reminds me of a recent chess robot incident [3]. Speed > is nice and safety has a cost, but do we already make the best possible > tradeoff here? I mean, if you want to make an alternative merge strategy that isn't based on a three-way merge, by all means go ahead. We have pluggable backends, after all. It's not clear to me that your suggestion actually leads to a situation with fewer unintended outcomes, though it is clear that performance could be dramatically worse. Let's explore this for a bit (more than I already have above)... Let's say you aim to fix this "unintended outcome" via paying attention to the changes from individual commits along the way. This isn't the only such issue users report that is intrinsic to how three-way content merges work. Do you decide to also address the others? From the top of my head, others include: A) generalize your logic here -- apply it to individual hunks rather than only to whole files. Using your exact same rationale would suggest that if someone modifies a hunk on one side of history and undid it later, then that region of code should not cleanly resolve to what the other side of history has but should instead be marked as conflicted. B) handle commits cherry-picked to both sides "better". If someone applies the same cherry-pick to both sides, and they then further modify the same lines on one side, there should be no "useless" conflict, because both sides "built on top of" the same cherry-picked state. C) make rename detection match files based on changes in individual commits rather than on the overall diff since the merge base It turns out that others have explored part of this space before, e.g. Bram Cohen's "Precise Codeville Merge"[4], which had names for behaviors (A) and (B) (see [5] and [6], respectively; also note that it points out that requirements (A) and (B) can come into conflict). Perhaps if you are interested, you can port his code over and use his algorithm? However, it might be worth noting that he later gave up on it and recommended that people use three-way merges instead[7]. Since Git seems to be the only system that does rename detection, (C) is new territory. But it's also relevant in that a user requested that the merge algorithm incorporate information from changes in individual commits along the history in order to solve a "bug" they ran into[8]. Also, besides implementing Bram's "Precise Codeville Merge" in Git, there may be another pre-existing solution available: using Michael Haggerty's git-imerge[9] and making sure it fully fills out its matrix of merges instead of opportunistically attempting to skip some. That would certainly make sure that changes from each individual commit was considered. But let's say you don't want to use either Precise Codeville Merge or git-imerge, and want to make a copy of our default strategy (merge-ort) and modify it. Once you implement your new strategy, I'm sure the other cases in (A), (B), and (C) will be brought up by people and it's hard to see based on your rationale so far why your new algorithm shouldn't attempt to solve those issues as well. If you do, your performance will really tank. Not only did you kill the trivial directory resolution optimization, you also kill the irrelevant renames optimization (which, IMO, was the crowning jewel of the ort optimizations). But it's also worse than that because (A) and likely (B) force you to compute pairwise diffs of every sequence of commits since the merge base, and (C) forces those all to include rename detection (both for full and partial renames). But it gets worse. Some of those commits in the history back to the merge base will be merge commits, for which we have to be careful with the diff computation. We can't just ignore merge commits, because change-and-then-unchanged code might have been done in merge commits as part of an attempt to fix semantic conflicts. So, to address merge commits, we have to see the changes made by users in those merges relative to the auto-merge state. If we do that, the upshot is that handling (A)-(C) implies that to merge two branches, we have to look at all merge commits in the history of the two branches back to the merge base, and remerge every one of them. Maybe you're okay with all of the above, even with others giving up on merge strategies other than three-way merge. Maybe individually merging all commits since the merge base doesn't phase you. Maybe you can make some kind of distinction where some of (A)-(C) don't matter. Maybe you just capitulate and say you only care about the problem that started this thread and ignore all three of those requests. Maybe you don't care about git-am and exclude it, overlooking the extra difference we've added between it and the other commands (a bit sad since merge-ort had removed one of those differences and I like commands being more consistent). Maybe the resulting "mere" 25x slowdown is good enough for you for your repository sizes. Or maybe you'd be hiding this behind some non-default option that people could select, allowing them to make the tradeoff. Whether any or all of that is true, I just don't see how to maintain it with the rationale and description given, so I'm not comfortable with it affecting the code I have to maintain. So, I'm skeptical. But feel free to create a new merge strategy and prove me wrong. [4] https://tonyg.github.io/revctrl.org/PreciseCodevilleMerge.html [5] https://tonyg.github.io/revctrl.org/ImplicitUndo.html [6] https://tonyg.github.io/revctrl.org/Convergence.html [7] https://bramcohen.livejournal.com/52148.html [8] https://lore.kernel.org/git/CABPp-BHF5j9XTrJOreHMaQ+TrZB1VgZvs-Bq2vFD1yhLBdkV3A@xxxxxxxxxxxxxx/ [9] https://github.com/mhagger/git-imerge