Re: Fix up diffcore-rename scoring

Geert Bosch <bosch@xxxxxxxxxxx> · Thu, 6 Apr 2006 17:01:04 -0400

On Mar 13, 2006, at 02:44, Linus Torvalds wrote:
It might be that the fast delta thing is a good way to ask "is this  
even
worth considering", to cut down the O(m*n) rename/copy detection to
something much smaller, and then use xdelta() to actually figure  
out what
is a good rename and what isn't from a much smaller set of potential
targets.

Here's a possible way to do that first cut. Basically,
compute a short (256-bit) fingerprint for each file, such
that the Hamming distance between two fingerprints is a measure
for their similarity. I'll include a draft write up below.

My initial implementation seems reasonably fast, works
great for 4000 (decompressed) files (25M) randomly plucked
from an old git.git repository without packs. It works OK for
comparing tar archives for GCC releases, but then it becomes
clear that random walks aren't that random anymore and
become dominated by repeated information, such as tar headers.

Speed is about 10MB/sec on my PowerBook, but one could cache
fingerprints so they only need to be computed once.
The nice thing is that one can quickly find similar files
only using the fingerprint (and in practice file size),
no filenames: this seems to fit the git model well.

I'll attach my test implementation below, it uses
David Mazieres Rabinpoly code and D. Phillips's fls code.
Please don't mind my C coding, it's not my native language.
Also, this may have some Darwinisms, although it should
work on Linux too.

  -Geert

Estimating Similarity

For estimating similarity between strings A and B, let
SA and SB be the collection of all substrings with length
W of A and B. Similarity now is defined as the ratio of
the intersection and the union of SA and SB.

The length W of these substrings is the window size, and here is
chosen somewhat arbitrarily to be 48. The idea is to make them not
so short that all context is lost (like counting symbol frequencies),
but not so long that a few small changes can affect a large portion
of substrings.  Of course, a single symbol change may affect up to
48 substrings.

Let "&" be the string concatenation operator.
If A = S2 & S1 & S2 & S3 & S2, and B = S2 & S3 & S2 & S1 & S2,
then if the length of S2 is at least W - 1, the strings
will have the same set of substrings and be considered equal
for purpose of similarity checking.  This behavior is actually
welcome, since reordering sufficiently separated pieces of a
document do not make it substantially different.

Instead of computing the ratio of identical substrings directly,
compute a 1-bit hash for each substring and calculate the difference
between the number of zeroes and ones. If the hashes appear random,
this difference follows a binomial distribution. Two files are
considered "likely similar" if their differences have the same sign.

The assumption that the hashes are randomly distributed, is not
true if there are many repeated substrings. For most applications,
it will be sufficient to ignore such repetitions (by using a small
cache of recently encountered hashes) as they do not convey much
actual information. For example, for purposes of finding small
deltas between strings, duplicating existing text will not significantly
increase the delta.

For a string with N substrings, of which K changed, perform a random
walk of N steps in 1-dimensional space (see [1]): what is the  
probability
the origin was crossed an odd number of times in the last K steps?
As the expected distance is Sqrt (2 * N / Pi), this probability
gets progressively smaller for larger N and a given ratio of N and K.
For larger files, the result should be quite stable.

In order to strengthen this similarity check and be able to
quantify the degree of similarity, many independent 1-bit hashes
are computed and counted for each string and assembled into
a bit vector of 256 bits, called the fingerprint. Each bit
of the fingerprint represents the result of independent
statistical experiment. For similar strings, corresponding bits
are more likely to be the same than for random strings.

For efficiency, a 64-bit hash is computed using a irreducible
Rabin polynomial of degree 63. The algebraic properties
of these allow for efficient calculation over a sliding window
of the input. [2] As the cryptographic advantages of randomly
generated hash functions are not required, a fixed polynomial
has been chosen.

This 64-bit hash is expanded to 256 bits by using three bits
to select 32 of the 256 bits in the fingerprint to update.
So, for every 8-bit character the polynomial needs updating,
and 32 counters are incremented or decremented.
So, each of the 256 counters represents a random walk that
is N / 4, for a string of length N.

The similarity of A and B can now be expressed as the Hamming
distance between the two bit vectors, divided by the expected
distance between two random vectors. This similarity score is
a number between 0 and 2, where smaller values mean the strings
are more similar, and values of 1 or more mean they are dissimilar.

One of the unique properties of this fingerprint is the
ability to compare files in different locations by only
transmitting their fingerprint.

Attachment:
gsimm.c

Description: Binary data
Attachment:
rabinpoly.c

Description: Binary data
Attachment:
rabinpoly.h

Description: Binary data