libxdiff and patience diff

Pierre Habouzit <madcoder@xxxxxxxxxx> · Tue, 04 Nov 2008 01:40:01 +0100

Hi Davide,

I've been working tonight, trying to make libxdiff support the patience
diff algorithm, but I've totally failed, because I _thought_ I
understood what xdl_split was doing, but it appears I don't.

[ For the readers playing at home, the patience diff algorithm is
  explained after my sig. ]

What I did is to:
(1) add a flag to the xenv in xdl_split that says that I want a
    patience "split".
(2) Then in xdl_split, if that bit is set, I compute the longest common
    subsequence from the patience diff.
(3) for each split it computes I call xdl_recs_cmp on that interval.

What I thought it would achieve is that I force some boundaries at which
libxdiff _must_ resync. Though, it seems that for some reason it doesn't
work, probably because the "ha" stuff and the boundaries absolutely
don't work the way I thought it did.

So where is the place I should do that ? I suspect it should be
partly in xprepare.c but I'm a bit stuck right now.

Any pointer on how the stuff in xprepare.c and xdiffi.c work would help
greatly, it's really not self-evident to me :)

-- 
·O·  Pierre Habouzit
··O                                                madcoder@xxxxxxxxxx
OOO                                                http://www.madism.org

Patience diff
=============

Basically, it's an algorithm that considers line from the left file A
and the right file B.

It searches for lines that are unique *IN* A and unique *IN* B as well.
Then using a clever O(n log log n) where n is that number of lines, you
extract the longer common sequence of those lines.

This defines two splits of file A and B. For each corresponding chunks,
you then reduce the lines matching at the beginning and the end, and
reiterate the algorithm on the interior of that space, until there are
_no_ unique lines at all, and then you apply a usual diff algorithm to
generate the diff in that final section.

http://alfedenzo.livejournal.com/170301.html has a nice visual
explanation of the fact, even if it forgets about the "zones" trimming
that helps for the efficiency. It's also often wrong to generate the
stacks in the order shown on the blog, because you recurse from the max
line to the min line, which is not the natural order to work on a diff.
But that's just a matter of "inverting" all the comparisons which is
pretty obvious to do.

The difference of output can be seen on http://glandium.org/blog/?p=120
where the patience diff picks the line
   "include $(topsrcdir)/config/rules.mk"
as being unique on the left and the right file, hence will use it as
sync point rather than using it in a diff.

Attachment:
pgpY0wItMWNVH.pgp

Description: PGP signature