Andy Parkins wrote: > On Monday 2006 November 20 10:48, Junio C Hamano wrote: > >> - Copies are only picked up from files that were changed in the >> same change (i.e. splitting major part of original file and >> moving it to somewhere else, while leaving a skelton in the >> original file). "harder" is needed if the copy original was >> untouched, as you found out. > > Yep; I understand that. I also understand that it is done for performance > reasons. However, since the typical copy will be one where the source > doesn't change at the same time, I am arguing that the non-hard copy > detection isn't much use. I'm not sure about this. You usually both do pure renames (to reorganize files, to give file a better name) and renames with modification, but I don't think that copy without modification is very common. Usually you copy a file because you take one file as template for the other, or you split file, or you join files into one file. >> The last one is a compromise between performance and thoroughness, >> and the "harder" is one knob to tweak its behaviour. > > I've been poking in tree-diff.c to see if I can understand why it it such a > performance hog. I still haven't. Each file is stored under its hash right? > So for copy detection why can't you just search for other files with the same > hash, which I presume is very fast (as it is the basis of what makes git so > fast)? Copy and rename detection are done by comparing the contents, calculating similarity. So to check if files A and B are copies (not necessary pure copies) it is not enough to compare hashes. That said, it should be fairly easy (if not that useful in true projects as I understand it, as stated above) to add to copy detection detection of pure copies by comparing hashes. Still, --find-copies-harder would be still needed if the copy original was untouched, while copy itself was modified. > I am probably misunderstanding git, but I guess that a copy isn't even needed > in the database because two files with the same hash in the working copy only > need storing once and then referencing twice. So for a copy (again, with my > simple understanding of git) we'd have: > > commit1 -> tree1 -> fileA = fileA_hash > ^ > | > commit2 -> tree2 -> fileA = fileA_hash > fileB = fileB_hash > > Doesn't that mean that copy detection is just a matter of searching the parent > commit trees for references to the same hash? Think copy'n'change. -- Jakub Narebski Warsaw, Poland ShadeHawk on #git - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html