Linus Torvalds <torvalds@xxxxxxxx> writes: > ... We're starting to see > git actually being able to track file content moving between files: even > when the files themselves didn't move (ie Junio's "git pickaxe" work could > do things like that). I've reordered the git-pickaxe I parked in "pu" while 1.4.3-rc cycle and merged it into "next". The earlier one I was futzing with in "pu" had built-in heuristics and pure mechanisms mixed together in the same patch, which was quite bad as development history. I think the reordered sequence shows the logical evolution better. 1. git-pickaxe: blame rewritten. This implements the infrastructure (parent traversal, identifying "corresponding path" in the parent -- aka "handling renames", passing blames to the parents and taking responsibility for the remainder) and uses the the same old "single diff with parent file identifies what we inherited from the parent" logic git-blame uses for passing blames. 2. git-pickaxe -M: blame line movements within a file. This adds logic to find swapped groups of lines in the same file. When the file in the parent had A and B and the child has B and A, "single diff with parent" would find only one of A or B is inherited from the parent, not both. This re-diffs the remainder with the parent's file to find both. I used to have heuristics to avoid trivial groups of lines from being subject to this step, but in this version they have been removed, so that we can see the core logic and need for heuristics more clearly. On the other hand, the version I used to have in "pu" gave blame to the first match. This one tries to find the best match and assign the blame to it. 3. git-pickaxe -C: blame cut-and-pasted lines. This adds logic to find groups of lines brought in from existing file in the parent. We scan the remainder using the same logic as -M detection, but it is done against other files in the parent. There was a heuristic that gave the blame to the parent right then and there when we find a copy-and-paste instead of allowing the parent to pass blame further on to its ancestors; again I removed this heuristics in the reordered series. The next logical step is to come up with a good set of heuristics to avoid excessive nonsense matches the code currently gives. Groups of small number of empty lines, lines with indentation blanks followed by a closing brace, and '#include' lines that include common header files occur so commonly, that without any heuristics (which can be seen in the "next" branch today) the algorithm would give surprisingly idiotic results. For example: git -p pickaxe -C -f -n v1.4.3 -- commit.c tells you that the first line of commit.c in v1.4.3 release, which is '#include "cache.h"' came from the first line of receive-pack.c which is total nonsense (this particular line could actually be a bug in the -M or -C logic -- I need to check). A less "obviously wrong" but still idiotic case is that we find ll.409-411 came from ll.94-96 of describe.c in commit 908e5310. These three lines read as: 409 } 410 } 411 While this blame assignment might be technically correct, it does not add much value to pass blames on in such a case. On the brighter side, we find that ll.415-419 (the beginning of function "static int get_one_line()") originally came from diff-tree.c (commit cee99d22, ll.275-279). - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html