On Mon, 22 Oct 2007, Andy C wrote: > > from diffcore-delta.c: > " > * Idea here is very simple. > * > * Almost all data we are interested in are text, but sometimes we have > * to deal with binary data. So we cut them into chunks delimited by > * LF byte, or 64-byte sequence, whichever comes first, and hash them. > " > > What is the similarity metric for binary files? Is it related to the > number of 64 byte chunks they have in common? How do you know that > the 64 byte chunks match up? Suppose I have 10k file, and then I > insert 10 random bytes at positions, 1000, 2000, 3000, etc. How does > that work? The 'LF' byte will start a new window, so even with binary files (assuming some random-enough distribution that you have *some* 'LF' bytes!), you basically get a synchronization event on each LF. So the 64-byte hunks "line up" automatically, even for binary files. Maybe not immediately, but soon enough (ie on average, assuming some reasonably compressed - ie virtually random - binary format) you should find a LF roughly every fourth hunk. There are probably smarter ways to guarantee it even in the absense of certain bit-patterns, but I bet the "break every 64 bytes or when you see a LF" thing works well for any reasonable binary file too. But text files are obviously the most important case. Diffs on binaries are somewhat hit-and-miss regardless. Linus - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html