Hi, Bram sent me two mails explaining a bit more of the details behind patience diff, and invited me to forward the content. So here you are: -- snipsnap -- From: Bram Cohen <spam-protected-as-per-brams-request@xxxxxxx> To: Johannes Schindelin <Johannes.Schindelin@xxxxxx> Sent: Wednesday, February 4, 2009 3:52:54 PM Subject: patience diff advantages/disadvantages Hey Johannes, I just did some web searching for patience search and came across the recent discussion of it. I'm not on whatever mailing list that discussion was on, but I have some thoughts on it which you might wish to share. The reverse engineered intuition that non-unique lines are 'less important' than unique lines is a correct one, and specifically the case of lots of curly bracket lines in C and END lines in Pascal and similar cases are the main motivating factor. For most common cases, any reasonable diff algorithm will give the same result. The cases where the results are quite dramatic are ones where the amount of change between the two versions is very large. The test cases being tried in recent discussion are from developers familiar with version control, who are being very judicious in their development process and intentionally trying to keep things from diverging too badly. In those cases, outputs will generally speaking always be the same and where they aren't the differences are frequently a judgement call. The really bad cases are ones where two versions have diverged dramatically and the developer isn't being careful to keep patch sizes under control. Under those circumstances a diff algorithm can occasionally become 'misaligned' in that it matches long sections of curly brackets together, but it winds up correlating the curly brackets of functions in one version with the curly brackets of the next later function in the other version. This situation is *very ugly*, and can result in a totally unusable conflict file in the situation where you need such things to be presented coherently the most. I have in fact seen it happen (in fact, it's happened to me). I don't have a sample of it, but as you can imagine such a sample case must by its nature be rather large and uninsightful - it basically just looks ugly. So 'unique lines' is a simple cross-language proxy for 'unimportant lines'. 'short lines' of course also can work, but might give false positives and it depends what you mean by 'short'. There's also an implementation and code maintainability advantage to unique lines. The patience algorithm is far simpler to understand and implement than the fancy-shmancy recent LCS algorithms. While patience sorting is far from obvious, LCS algorithms with good asymptotics are extremely opaque. I suspect that a properly implemented patience diff could be just as fast if not faster than LCS, but that would require quite a bit more algorithmic magic than I for one am prepared to design, test and debug. In very real terms the amount of time effort, and trickery the different implementations have had put into them is very different, so current speed differences aren't exactly a fair comparison (although far from irrelevant, because nobody with the necessary expertise is volunteering to spend that much time on patience diff as of yet). Some other things I forgot to mention - it isn't clear from the discussion whether you're doing the heuristic of matching beginning/ending lines when everything else doesn't match (this heuristic could also be applied first, which would be faster and almost always yield the same results). That's important so that long sections of white space get matched to each other. Also, I disagree that LCS is necessarily a good idea when matching, say 12121 to 212. What was meant by that case is highly ambiguous, and the most conservative thing is to simply give a conflict for the whole section. (If matching, say, 12121 to 1212 then the first lines match heuristic will make them mostly match with an extra 1 at the end of the first one). The most difficult real-world diff cases have to do with lots of blank lines, if we represent a blank line by 'S', then the diff from 'SSSSS' to 'SSSXSSS' is highly ambiguous - an extra S was added, but there are four reasonable places for assuming it was put in, each of which a reasonable case can be made for, and I suspect that there's an analogue to Arrow's theorem that trying to do the right thing in particular edge cases must necessarily result in intermediary cases where the criteria are in direct conflict. (My patience implementation, by the way, will match the first three lines of the five Ss to the the first three of the second set of Ss, and the last two to the last two, with the string XS inserted into the middle, which is arguably the best behavior, but could under some circumstances result in very confused three-way merges dependending on what happened on the other side). -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html