Re: [PATCH 0/3] Teach Git about the patience diff algorithm

Johannes Schindelin <Johannes.Schindelin@xxxxxx> · Fri, 2 Jan 2009 19:46:35 +0100 (CET)

Hi,

On Fri, 2 Jan 2009, Linus Torvalds wrote:

> On Fri, 2 Jan 2009, Clemens Buchacher wrote:
> >
> > Only two choices, and I still get it wrong. The diffs should be 
> > labeled the other way around, of course.
> 
> Yes, this one is a real patience diff change, but it's also the same one 
> that I've seen in the google fanboi findings. What google did _not_ show 
> was any real-life examples, or anybody doing any critical analysis.

FWIW it's the test case in the commit introducing the --patience option.

> So I was hoping for something else than a single "in this case patience 
> diff works really well". I was hoping to see what it does in real life.

I will dig out a real-world example where I _know_ patience diff would 
have helped.  (I remember that I rewrote a pretty large diff which was 
sent on this list, only to understand what it actually does, and I am 
pretty certain this is a good real-world showcase.)

But yes, I agree, the thing does not matter _all_ that much in reality.

The case where I expect a real improvement is when you modify a function 
and insert a function just before it, and Myers' algorithm matches mainly 
empty lines and lines ending in curly brackets.

In other words: something I tried to tackle with ee95ec5d(xdl_merge(): 
introduce XDL_MERGE_ZEALOUS_ALNUM) for merges.

The typical look of such a diff is something like this:

-<... some function header ...>
+<... a completely different function header ...>
 {
-	<... variables ...>
+	<... other variables ...>

 	for (i = 0; i < 10; i++) {
-		<... some code ...>
+		<... some code doing something completely different ...>
	}

 	return 0;
 }

+<... the function header which was removed earlier ...>
+{
 	<... refactored _and also reindented_ code ...>

> And I haven't seen _any_ real critical analysis of it. Anywhere.

Neither have I.  Let alone something close to documentation.

For example, when the "patience diff algorithm" is explained, it looks 
more like a longest common sequence algorithm when the input is already 
sorted in the first item.

Further, there is no rigorous analysis of the runtime (I figured that the 
original runtime is O(nm) where "n" is the number of lines and "m" is the 
length of the maximal ordered sequence of common unique lines, and my 
implementation can only improve that to O(n log(m))).

This could be improved, I think, for the most common case where you have 
pretty long common _continuous_ sequences of unique lines, i.e. large 
ranges of lines that are identical.

The runtime is especially horrible in the light of the runtime of Myers' 
algorithm, which uses O(n d), where d is the edit distance, i.e. the 
number of lines added + number of lines removed.  (Note: in the real 
world, there are substantial speed ups for consecutive edits, i.e. line 
ranges where there are no common lines at all.)

Also, I am less than thrilled by the job the fanbois did coming up with 
"convincing" evidence: exactly as you pointed out, there are _no_ 
real-world examples where it helps.

And the worst part: one can only _guess_ what motivated patience diff.  I 
imagine it came from the observation that function headers are unique, and 
that you usually want to preserve as much context around them.

Ciao,
Dscho

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html