Bram Cohen speaks up about patience diff

Johannes Schindelin <Johannes.Schindelin@xxxxxx> · Thu, 5 Feb 2009 21:19:54 +0100 (CET)

Hi,

Bram sent me two mails explaining a bit more of the details behind 
patience diff, and invited me to forward the content.  So here you are:

-- snipsnap --
From: Bram Cohen <spam-protected-as-per-brams-request@xxxxxxx>
To: Johannes Schindelin <Johannes.Schindelin@xxxxxx>
Sent: Wednesday, February 4, 2009 3:52:54 PM
Subject: patience diff advantages/disadvantages

Hey Johannes, I just did some web searching for patience search and came 
across the recent discussion of it. I'm not on whatever mailing list that 
discussion was on, but I have some thoughts on it which you might wish to 
share.

The reverse engineered intuition that non-unique lines are 'less 
important' than unique lines is a correct one, and specifically the case 
of lots of curly bracket lines in C and END lines in Pascal and similar 
cases are the main motivating factor.

For most common cases, any reasonable diff algorithm will give the same 
result. The cases where the results are quite dramatic are ones where the 
amount of change between the two versions is very large. The test cases 
being tried in recent discussion are from developers familiar with version 
control, who are being very judicious in their development process and 
intentionally trying to keep things from diverging too badly. In those 
cases, outputs will generally speaking always be the same and where they 
aren't the differences are frequently a judgement call.

The really bad cases are ones where two versions have diverged 
dramatically and the developer isn't being careful to keep patch sizes 
under control. Under those circumstances a diff algorithm can occasionally 
become 'misaligned' in that it matches long sections of curly brackets 
together, but it winds up correlating the curly brackets of functions in 
one version with the curly brackets of the next later function in the 
other version. This situation is *very ugly*, and can result in a totally 
unusable conflict file in the situation where you need such things to be 
presented coherently the most. I have in fact seen it happen (in fact, 
it's happened to me). I don't have a sample of it, but as you can imagine 
such a sample case must by its nature be rather large and uninsightful - 
it basically just looks ugly.

So 'unique lines' is a simple cross-language proxy for 'unimportant 
lines'. 'short lines' of course also can work, but might give false 
positives and it depends what you mean by 'short'. There's also an 
implementation and code maintainability advantage to unique lines. The 
patience algorithm is far simpler to understand and implement than the 
fancy-shmancy recent LCS algorithms. While patience sorting is far from 
obvious, LCS algorithms with good asymptotics are extremely opaque. I 
suspect that a properly implemented patience diff could be just as fast if 
not faster than LCS, but that would require quite a bit more algorithmic 
magic than I for one am prepared to design, test and debug. In very real 
terms the amount of time effort, and trickery the different 
implementations have had put into them is very different, so current speed 
differences aren't exactly a fair comparison (although far from 
irrelevant, because nobody with the necessary expertise is volunteering to 
spend that much time on patience diff as of yet).

Some other things I forgot to mention - it isn't clear from the discussion 
whether you're doing the heuristic of matching beginning/ending lines when 
everything else doesn't match (this heuristic could also be applied first, 
which would be faster and almost always yield the same results). That's 
important so that long sections of white space get matched to each other. 
Also, I disagree that LCS is necessarily a good idea when matching, say 
12121 to 212. What was meant by that case is highly ambiguous, and the 
most conservative thing is to simply give a conflict for the whole 
section. (If matching, say, 12121 to 1212 then the first lines match 
heuristic will make them mostly match with an extra 1 at the end of the 
first one).

The most difficult real-world diff cases have to do with lots of blank 
lines, if we represent a blank line by 'S', then the diff from 'SSSSS' to 
'SSSXSSS' is highly ambiguous - an extra S was added, but there are four 
reasonable places for assuming it was put in, each of which a reasonable 
case can be made for, and I suspect that there's an analogue to Arrow's 
theorem that trying to do the right thing in particular edge cases must 
necessarily result in intermediary cases where the criteria are in direct 
conflict. (My patience implementation, by the way, will match the first 
three lines of the five Ss to the the first three of the second set of Ss, 
and the last two to the last two, with the string XS inserted into the 
middle, which is arguably the best behavior, but could under some 
circumstances result in very confused three-way merges dependending on 
what happened on the other side).

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html