Re: [idea] File history tracking hints

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On 10/2/2017 3:18 PM, Stefan Beller wrote:
On Mon, Oct 2, 2017 at 11:51 AM, Jeff Hostetler <git@xxxxxxxxxxxxxxxxx> wrote:

Sorry to re-re-...-re-stir up such an old topic.

I wasn't really thinking about commit-to-commit hints.
I think these have lots of problems.  (If commit A->B does
"t/* -> tests/*" and commit B->C does "test/*.c -> xyx/*",
then you need a way to compute a transitive closure to see
the net-net hints for A->C.  I think that quickly spirals
out of control.)

I agree. Though as a human I can still look at
A..C giving the hint that t/*.c and xyz/*.c ought to
be taken into account for rename detection.
(which is currently done with -M -C --find-copies-harder
as a generic "there are renamed things", and not the very
specific rule, that may be cheaper to examine compared to
these generic rules)

No, I was going in another direction.  For example, if a
tree-entry contains { file-guid, file-name, file-sha, ... }
then when diffing any 2 commits, you can match up files
(and folders) by their guids.  Renames pop out trivially when
their file-names don't match.  File moves pop out when the
file-guids appear in different trees.  Adds and deletes pop
out when file-guids don't have a peer. (I'm glossing over some
of the details, but you get the idea.)

How do you know when a guid needs adaption?

I'm not sure I know what you mean by "adaption".


(c.f. origin/jt/packmigrate)
If a commit moves a function out of a file into a new file,
the ideal version control could notice that the function
was moved into a new file and still attribute the original
authors by ignoring the move commit.

I think that's an orthogonal problem.  I could move a function
from one file to an existing file or to a new file it doesn't
matter.  Attributing those lines back to the original author
(rather than the mover) is a bit of a pipe dream IMHO.  And I
have to wonder if it is always the correct thing to do?  I can
see scenarios where you'd want the mover.

I guess there's nothing from stopping the "ideal VC system"
doing all this line-based analysis, but that shouldn't make
file renames expensive to detect (since that is the granularity
that people and most tools expect the system to work with).


Another series in flight could have modified that
function slightly (fixed a bug), such that it's hard to
reason about these things.

For guids I imagine the new file gets a new guid, such that
tracking the function becomes harder?


Yeah, I'm not thinking about tracking individual functions.


To address Junio's
question, independently added files with the same name will
have 2 different file-guids.  We amend the merge rules to
handle this case and pick one of them (say, the one that
is sorts less than the other) as the winner and go on.
All-in-all the solution is not trivial (as there are a few
edge cases to deal with), but it better matches the (casual)
user's perception of what happened to their tree over time.

The GUID would be made up at creation time, I assume?
Is there any input other than the file itself? (I assumed so
initially, such that:
   By having a GUID in the tree, we would divorce from the notion
   of a "content addressable file system" quickly, as we both could
   create the same tree locally (containing the same blobs) and
   yet the trees would have different names due to having different
   GUIDs in them
), which I'd find undesirable.

Right.  A real solution would store the guid data slightly
differently so we could preserve the existing SHA properties.
My example was more conceptual.


It also doesn't require expensive code to sniff for renames
on every command (which doesn't scale on really large repos).

I wonder if the rename detection could be offloaded to a server
(which scales) that provides a "hint file" to clients, such that the
clients can then cheaply make use of these specific hints.


I don't know.  Might be easier to add that computation to the
occasional client-side housekeeping (somewhat like the commit
generation number computation we keep talking about).

Thanks
Jeff



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux