Re: Rename detection at git log

Jakub Narebski <jnareb@xxxxxxxxx> · Mon, 20 Nov 2006 12:15:30 +0100

Andy Parkins wrote:

> On Monday 2006 November 20 10:48, Junio C Hamano wrote:
>
>>  - Copies are only picked up from files that were changed in the
>>    same change (i.e. splitting major part of original file and
>>    moving it to somewhere else, while leaving a skelton in the
>>    original file).  "harder" is needed if the copy original was
>>    untouched, as you found out.
> 
> Yep; I understand that.  I also understand that it is done for performance 
> reasons.  However, since the typical copy will be one where the source 
> doesn't change at the same time, I am arguing that the non-hard copy 
> detection isn't much use.

I'm not sure about this. You usually both do pure renames (to reorganize
files, to give file a better name) and renames with modification, but
I don't think that copy without modification is very common. Usually you
copy a file because you take one file as template for the other, or you
split file, or you join files into one file.

>> The last one is a compromise between performance and thoroughness,
>> and the "harder" is one knob to tweak its behaviour.
> 
> I've been poking in tree-diff.c to see if I can understand why it it such a 
> performance hog.  I still haven't.  Each file is stored under its hash right?  
> So for copy detection why can't you just search for other files with the same 
> hash, which I presume is very fast (as it is the basis of what makes git so 
> fast)?

Copy and rename detection are done by comparing the contents, calculating
similarity. So to check if files A and B are copies (not necessary pure
copies) it is not enough to compare hashes.

That said, it should be fairly easy (if not that useful in true projects
as I understand it, as stated above) to add to copy detection detection of
pure copies by comparing hashes. Still, --find-copies-harder would be still
needed if the copy original was untouched, while copy itself was modified.

> I am probably misunderstanding git, but I guess that a copy isn't even needed 
> in the database because two files with the same hash in the working copy only 
> need storing once and then referencing twice.  So for a copy (again, with my 
> simple understanding of git) we'd have:
> 
>  commit1 -> tree1 -> fileA = fileA_hash
>     ^
>     |
>  commit2 -> tree2 -> fileA = fileA_hash
>                      fileB = fileB_hash
> 
> Doesn't that mean that copy detection is just a matter of searching the parent 
> commit trees for references to the same hash?

Think copy'n'change.
-- 
Jakub Narebski
Warsaw, Poland
ShadeHawk on #git

-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html