Re: Git Rename Detection Bug

Philip Oakley <philipoakley@iee.email> · Wed, 15 Nov 2023 14:36:27 +0000

Hi Elijah,
sorry for the delay in replying.

On 11/11/2023 15:13, Elijah Newren wrote:
> Hi,
> 
> On Sat, Nov 11, 2023 at 3:08 AM Philip Oakley <philipoakley@iee.email> wrote:
>>
>> Hi all,
>>
>> On 11/11/2023 05:46, Elijah Newren wrote:
>>> The fact that you were trying to "undo" renames and "redo the correct
>>> ones" suggested there's something you still didn't understand about
>>> rename detection, though.
>>
>>
>> Could I suggest that we are missing a piece of terminology, to wit,
>> BLOBSAME. It's a compatriot to TREESAME, as used in `git log` for
>> history simplification (based on a tree's pathspec, most commonly a
>> commit's top level path).
> 
> We could add it, but I'm not sure how it helps.  We already had 'exact
> rename' which seems to fit the bill as well,

My point was that we already had the confusion of mental models, with
both sides essentially thinking they had an "exact rename", hence my
thought was to add a rather distinct technical name which reflected the
Git mind-shift. Without something to bring folks up short they'll
continue, erroneously, with their prior mental models.

 and 'blob' is something
> someone new to Git is unlikely to know.

I'd agree that BLOBSAME is new, but we should be proactive in ensuring
folk do have the mind shift from the old centralised VCS misunderstandings.

> 
> Perhaps it's useful in some other context, though?
> 
>> File rename, at it's most basic, is when the blob associated with that
>> changed path is identical, i.e. BLOBSAME. There is no need to 'record'
>> the action of renaming, moving or whatever, the content sameness is
>> right there, in plain sight, as an identical blob name.   After that
>> (files with slight variations) it is a load of heuristics, but starting
>> with BLOBSAME we see how easy the basic rename detection is, and why
>> renames (and de-dup) don't need recording.
> 
> This is incorrect.  Let's say you have a file foo:
>    * base version: foo has hash A
>    * our version: foo has been renamed to bar, but bar still has hash A
>    * their version: foo has been modified; it now has hash B
> 
> The foo->bar is an exact rename (or they are BLOBSAME if you prefer),
> but the renaming/moving/whatever is a critical piece of information
> because the changes to foo in 'their' version need to be applied to
> bar to get the correct end results.

Isn't that what I thought I'd said?
Hash A = Hash A => identical content;
Hash A != B => different content.

> 
> I do not know if in Jeremy's case foo has been modified on the
> unrenamed side.  But the following hypothetical is exactly the type of
> problem Jeremy is hitting: what should happen when 'our' version has
> both a new 'bar' and a new 'baz' file that each have hash A?  In that
> case, to which one was foo renamed?  It's inherently ambiguous.

true, the terminology hasn't kept up with the methodology for blob
content, and the independent meta-data. In previous 'ort' discussions I
didn't really understand what the '1/2' renames (and other
nomenclatures) really meant with respect to paths, filenames, content
and the ours / theirs / base distinctions.
> 
>> The heuristics of 'rename with small change' is trickier, but for a
>> basic understanding, starting at BLOBSAME (and TREESAME for directory
>> renames) should make it easier to grasp the concepts.
> 
> Interesting; TREESAME isn't used within directory rename detection
> currently; it is only used currently when two (or three) trees with
> the same name are TREESAME, in order to potentially avoid recursing
> into the tree.  But even then, having two trees with the same name be
> TREESAME isn't enough on its own to avoid recursing into that tree,
> because the other side could have added files within the same-named
> tree and we need to know about those added files because they could be
> part of renames involving other files outside that tree. 

definitely easy to get confused on these cases...

>      There would
> probably be similar challenges to attempting to apply the concept of
> TREESAME to directory rename detection to two trees of different
> names, but it's at least an interesting idea.  Hmm....
> 

Thanks for the insights.

Philip