Re: Linear time/space rename logic for inexact case

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Mon, 22 Oct 2007 17:49:39 -0700 (PDT)

On Mon, 22 Oct 2007, Andy C wrote:
> 
> from diffcore-delta.c:
> "
>  * Idea here is very simple.
>  *
>  * Almost all data we are interested in are text, but sometimes we have
>  * to deal with binary data.  So we cut them into chunks delimited by
>  * LF byte, or 64-byte sequence, whichever comes first, and hash them.
> "
> 
> What is the similarity metric for binary files?  Is it related to the
> number of 64 byte chunks they have in common?  How do you know that
> the 64 byte chunks match up?  Suppose I have 10k file, and then I
> insert 10 random bytes at positions, 1000, 2000, 3000, etc.   How does
> that work?

The 'LF' byte will start a new window, so even with binary files (assuming 
some random-enough distribution that you have *some* 'LF' bytes!), you 
basically get a synchronization event on each LF.

So the 64-byte hunks "line up" automatically, even for binary files. Maybe 
not immediately, but soon enough (ie on average, assuming some reasonably 
compressed - ie virtually random - binary format) you should find a LF 
roughly every fourth hunk.

There are probably smarter ways to guarantee it even in the absense of 
certain bit-patterns, but I bet the "break every 64 bytes or when you see 
a LF" thing works well for any reasonable binary file too.

But text files are obviously the most important case. Diffs on binaries 
are somewhat hit-and-miss regardless.

			Linus
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Linear time/space rename logic for *inexact* case

Re: Linear time/space rename logic for inexact case