Re: [RFH] Finding all commits that touch the same files as a specific commit

"Sverre Rabbelier" <alturin@xxxxxxxxx> · Sun, 13 Jul 2008 16:43:21 +0200

On Sun, Jul 13, 2008 at 3:24 AM, Junio C Hamano <gitster@xxxxxxxxx> wrote:

<explanation of the git log traversal machinery snipped>

> In order to follow renames reliably in a merge heavy history, you need to
> keep track of the pathname the file you are interested in appears as _in
> each commit_.  As you traverse down the history, you pass down the
> pathname to the parent you visit, so while you are traversing from 'x' to
> earlier 'x', you will keep following "git-gui/git-gui.sh", while you
> traverse down to 'o', you will inspect "git-gui.sh".
>
> The data structure the revision traversal machinery uses does not support
> this "path-per-commit" natively.

Would it be possible to go for a slightly less complicated approach
and instead of passing replacing the tracked file, append it? We
already have a list of files we are tracking, so I assume the data
structure does support that. Such would run with the risk of tracking
too much (e.g., you rename a.txt => b.txt, and then later on
create/rename a new a.txt which is then tracked as well).

> This is the reason "git-blame" uses its own traversal engine.  It keeps
> track of <commit, path> pairs so that it can mark which line came from
> what path in what commit.  When copy/move detection are used, we can even
> notice that the contents we are interested in came from more than one file
> in the same commits, and the data structure supports it (i.e. it is not
> just a pointer to a single string from "struct commit").

So what could be done is use a blame-like mechanism that invokes
rename detection on each interesting commit and then record that
information? Purely hypothetical though, since I know neither and have
no time to do so.

> For the purpose of "git log" traversal and the "file renames" people
> usually talk about, this is overkill; you should however be able to
> backport the basic idea to revision machinery, if you really cared.

Right, that'd teach git log how to follow across renames in an
intelligent manner that works also for non-linear histories at the
cost of using up more memory and cpu?

> In a real history, "file rename" is a very ill defined concept and is not
> always useful in practice.  I did a fairly detailed analysis on one
> real-world history more than two years ago, which is found here:
>
>    http://thread.gmane.org/gmane.comp.version-control.git/13746/focus=13769

Aye, I agree that a 'rename' is hard to define and that a lot of
effort could be put into supporting 'renames' that are not trivial
(e.g., more complex than 'git mv foo.txt bar.txt').

> In our own "git.git" history, the evolution of what finally landed in
> revision.c is interesting.  The interesting part of content movement never
> involved any file renames --- only bits and pieces migrated over across
> many files.  That is not something "file rename tracking", even with an
> extension to the revision traversal machinery to keep one path per commit
> to record the file you are interested in, can ever give meaningful
> explanation of the history.  You need a lot more fine grained "blame"
> traversal machinery for that.

This makes sense, but it (using blame traversal machinery) is overkill
for what I am interested in. What I think would be a good goal in
supporting is the subtree merge strategy. It would be nice if 'git log
--follow-subtree-merge refspec -- filefilter' or such would Just Work
(TM). Perhaps that the hunk-tracking I am working on with Dscho could
help make 'git log --numstat' more accurate. Those two combined (git
log being able to follow across subtree merges and being able to
recognise hunks being moved) would be all that I need.

-- 
Cheers,

Sverre Rabbelier
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html