Linus Torvalds <torvalds@xxxxxxxx> wrote: > On Fri, 20 Oct 2006, Shawn Pearce wrote: > > > > I renamed hundreds of small files in one shot and also did a few > > hundered adds and deletes of other small XML files. Git generated > > a lot of those unrelated adds/deletes as rename/modifies, as their > > content was very similiar. Some people involved in the project > > freaked as the files actually had nothing in common with one > > another... except for a lot of XML elements (as they shared the > > same DTD). > > Heh. We can probably tweak the heuristics (one of the _great_ things about > content detection is that you can fix it after the fact, unlike the > alternative). > > That said, I've personally actually found the content-based similarity > analysis to often be quite informative, even when (and perhaps > _especially_ when) it ended up showing something that the actual author of > the thing didn't intend. > > So yeah, I've seen a few strange cases myself, but they've actually been > interesting. Like seeing how much of a file was just a copyright license, > and then a file being considered a "copy" just because it didn't actually > introduce any real new code. Aside from that one strange case I just mentioned I've always seen the strategy to work very well. Its never done something I didn't expect and I've never seen copies or that I didn't expect to see, knowing what the author of the change did. So even though I had a little bit of trouble with that rename situation above I'm _very_ happy with the way Git handles renames. And the truth is that case above really was quite correct: XML is very verbose. When 70% of the file is just required XML to frame the other 30% of the file's payload its not surprising that files are considered to be similar when they only differ by a little bit of payload. -- Shawn. - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html