Re: [idea] File history tracking hints

Stefan Beller <sbeller@xxxxxxxxxx> · Mon, 2 Oct 2017 12:18:59 -0700

On Mon, Oct 2, 2017 at 11:51 AM, Jeff Hostetler <git@xxxxxxxxxxxxxxxxx> wrote:

> Sorry to re-re-...-re-stir up such an old topic.
>
> I wasn't really thinking about commit-to-commit hints.
> I think these have lots of problems.  (If commit A->B does
> "t/* -> tests/*" and commit B->C does "test/*.c -> xyx/*",
> then you need a way to compute a transitive closure to see
> the net-net hints for A->C.  I think that quickly spirals
> out of control.)

I agree. Though as a human I can still look at
A..C giving the hint that t/*.c and xyz/*.c ought to
be taken into account for rename detection.
(which is currently done with -M -C --find-copies-harder
as a generic "there are renamed things", and not the very
specific rule, that may be cheaper to examine compared to
these generic rules)

> No, I was going in another direction.  For example, if a
> tree-entry contains { file-guid, file-name, file-sha, ... }
> then when diffing any 2 commits, you can match up files
> (and folders) by their guids.  Renames pop out trivially when
> their file-names don't match.  File moves pop out when the
> file-guids appear in different trees.  Adds and deletes pop
> out when file-guids don't have a peer. (I'm glossing over some
> of the details, but you get the idea.)

How do you know when a guid needs adaption?

(c.f. origin/jt/packmigrate)
If a commit moves a function out of a file into a new file,
the ideal version control could notice that the function
was moved into a new file and still attribute the original
authors by ignoring the move commit.

Another series in flight could have modified that
function slightly (fixed a bug), such that it's hard to
reason about these things.

For guids I imagine the new file gets a new guid, such that
tracking the function becomes harder?

> To address Junio's
> question, independently added files with the same name will
> have 2 different file-guids.  We amend the merge rules to
> handle this case and pick one of them (say, the one that
> is sorts less than the other) as the winner and go on.
> All-in-all the solution is not trivial (as there are a few
> edge cases to deal with), but it better matches the (casual)
> user's perception of what happened to their tree over time.

The GUID would be made up at creation time, I assume?
Is there any input other than the file itself? (I assumed so
initially, such that:
  By having a GUID in the tree, we would divorce from the notion
  of a "content addressable file system" quickly, as we both could
  create the same tree locally (containing the same blobs) and
  yet the trees would have different names due to having different
  GUIDs in them
), which I'd find undesirable.

> It also doesn't require expensive code to sniff for renames
> on every command (which doesn't scale on really large repos).

I wonder if the rename detection could be offloaded to a server
(which scales) that provides a "hint file" to clients, such that the
clients can then cheaply make use of these specific hints.