On Mar 13, 2006, at 02:44, Linus Torvalds wrote:
It might be that the fast delta thing is a good way to ask "is this evenworth considering", to cut down the O(m*n) rename/copy detection tosomething much smaller, and then use xdelta() to actually figure out whatis a good rename and what isn't from a much smaller set of potential targets.
Here's a possible way to do that first cut. Basically, compute a short (256-bit) fingerprint for each file, such that the Hamming distance between two fingerprints is a measure for their similarity. I'll include a draft write up below. My initial implementation seems reasonably fast, works great for 4000 (decompressed) files (25M) randomly plucked from an old git.git repository without packs. It works OK for comparing tar archives for GCC releases, but then it becomes clear that random walks aren't that random anymore and become dominated by repeated information, such as tar headers. Speed is about 10MB/sec on my PowerBook, but one could cache fingerprints so they only need to be computed once. The nice thing is that one can quickly find similar files only using the fingerprint (and in practice file size), no filenames: this seems to fit the git model well. I'll attach my test implementation below, it uses David Mazieres Rabinpoly code and D. Phillips's fls code. Please don't mind my C coding, it's not my native language. Also, this may have some Darwinisms, although it should work on Linux too. -Geert Estimating Similarity For estimating similarity between strings A and B, let SA and SB be the collection of all substrings with length W of A and B. Similarity now is defined as the ratio of the intersection and the union of SA and SB. The length W of these substrings is the window size, and here is chosen somewhat arbitrarily to be 48. The idea is to make them not so short that all context is lost (like counting symbol frequencies), but not so long that a few small changes can affect a large portion of substrings. Of course, a single symbol change may affect up to 48 substrings. Let "&" be the string concatenation operator. If A = S2 & S1 & S2 & S3 & S2, and B = S2 & S3 & S2 & S1 & S2, then if the length of S2 is at least W - 1, the strings will have the same set of substrings and be considered equal for purpose of similarity checking. This behavior is actually welcome, since reordering sufficiently separated pieces of a document do not make it substantially different. Instead of computing the ratio of identical substrings directly, compute a 1-bit hash for each substring and calculate the difference between the number of zeroes and ones. If the hashes appear random, this difference follows a binomial distribution. Two files are considered "likely similar" if their differences have the same sign. The assumption that the hashes are randomly distributed, is not true if there are many repeated substrings. For most applications, it will be sufficient to ignore such repetitions (by using a small cache of recently encountered hashes) as they do not convey much actual information. For example, for purposes of finding small deltas between strings, duplicating existing text will not significantly increase the delta. For a string with N substrings, of which K changed, perform a randomwalk of N steps in 1-dimensional space (see [1]): what is the probability
the origin was crossed an odd number of times in the last K steps? As the expected distance is Sqrt (2 * N / Pi), this probability gets progressively smaller for larger N and a given ratio of N and K. For larger files, the result should be quite stable. In order to strengthen this similarity check and be able to quantify the degree of similarity, many independent 1-bit hashes are computed and counted for each string and assembled into a bit vector of 256 bits, called the fingerprint. Each bit of the fingerprint represents the result of independent statistical experiment. For similar strings, corresponding bits are more likely to be the same than for random strings. For efficiency, a 64-bit hash is computed using a irreducible Rabin polynomial of degree 63. The algebraic properties of these allow for efficient calculation over a sliding window of the input. [2] As the cryptographic advantages of randomly generated hash functions are not required, a fixed polynomial has been chosen. This 64-bit hash is expanded to 256 bits by using three bits to select 32 of the 256 bits in the fingerprint to update. So, for every 8-bit character the polynomial needs updating, and 32 counters are incremented or decremented. So, each of the 256 counters represents a random walk that is N / 4, for a string of length N. The similarity of A and B can now be expressed as the Hamming distance between the two bit vectors, divided by the expected distance between two random vectors. This similarity score is a number between 0 and 2, where smaller values mean the strings are more similar, and values of 1 or more mean they are dissimilar. One of the unique properties of this fingerprint is the ability to compare files in different locations by only transmitting their fingerprint.
Attachment:
gsimm.c
Description: Binary data
Attachment:
rabinpoly.c
Description: Binary data
Attachment:
rabinpoly.h
Description: Binary data