Re: [idea] File history tracking hints

Junio C Hamano <gitster@xxxxxxxxx> · Sun, 01 Oct 2017 12:27:04 +0900

Jeff Hostetler <git@xxxxxxxxxxxxxxxxx> writes:

> On 9/29/2017 7:12 PM, Johannes Schindelin wrote:
>
>> Therefore, it would be good to have a way to tell Git about renames
>> explicitly so that it does not even need to use its heuristics.
>
> Agreed.
>
> It would be nice if every file (and tree) had a permanent GUID
> associated with it.  Then the filename/pathname becomes a property
> of the GUIDs.  Then you can exactly know about moves/renames with
> minimal effort (and no guessing).

I actually like the idea to have a mechanism where the user can give
hint to influence, or instruction to dictate, how Git determines
"this old path moved to this new path" when comparing two trees.  A
human would not consider a new file (e.g. header file) that begins
with a few dozen commonly-seen boilerplate lines (e.g. copyright
statement) followed by several lines unique to the new contents to
be a rename of a disappearing old file that begins with the same
boilerplate followed by several lines that are different from what
is in the new file, but Git's algorithm would give equal weight to
all of these lines when deciding how similar the new file is to the
old file, and can misidentify a new file to be a rename of an old
file that is unrelated.  Even when Git can and does determine the
pairing correctly, it would be a win if we do not have to recompute
the same pairing every time.  So both as hint and as cache, such a
mechanism would make sense [*1*].

But "file ID" does not have any place to contribute to such a
mechanism.  Each of two developers working on the same project in a
disributed environment can grab the same gist and create a new file
in his or her tree, perhaps at the same path or at a different
path.  At the time of such an addition, there is no way for each of
them to give these two files the same "file ID" (that is how the
world works in the distributed environment after all)---which "file
ID" should survive when their two histories finally meet and results
in a single file after a merge?  A file with "file ID" may not be
renamed but may be copied and evolve separately and differently.
Which one should inherit its original "file ID" and how does having
"file ID" help us identify the other one is equally related to the
original file?  These two are merely examples that "file ID"s would
cause while solving "only" what can be expressed in "git diff -M"
output (the latter illustrates that it does not even help showing
"git diff -C").

And when we stop limiting ourselves to the whole-file renames and
copies (which can be expressed in "git diff" output) but also want
to help finer-grained operation like "git blame", we'd want to have
something that helps in situations like a single file's contents
split into multiple files and multiple files' contents concatenated
into a single new file, both of which happens during code
refactoring.  "file ID" would not contribute an iota in helping
these situations.  

I've said this number of times, and I'll say this again, but one of
the most important message in our list archive is gmane:217 aka

https://public-inbox.org/git/Pine.LNX.4.58.0504150753440.7211@xxxxxxxxxxxxxxx/

I'd encourge people to read and re-read that message until they can
recite it by heart.

Linus mentions "CVS annotate"; the message was written long before
we had "git blame", and it served as a guide when desiging how we
dig contents movement in various parts of the system.

[Footnote]

*1* There are many possible implementations; the most obvious would
    be to record a pair of blob object names and instruct Git when
    it seems one side of a pair disappearing and the other side of
    the pair appearing, take the pair as a rename.  And that would
    be sufficient for "git log -M".  

    Such a cache/hint alone however would not help much in "git
    merge" without further work, as we merge using only the tree
    state of the three points in the history (i.e. the common
    ancestor and two tips).  merge-recursive needs to be taught to
    find the renames at each commit it finds throughout the history
    from the ancestor and each tip and carry its finding through if
    it wants to take advantage of such hint/cache.