Re: libxdiff and patience diff

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Nov 04, 2008 at 05:39:48AM +0000, Johannes Schindelin wrote:
> Hi,
> 
> On Tue, 4 Nov 2008, Pierre Habouzit wrote:
> 
> > I've been working tonight, trying to make libxdiff support the patience 
> > diff algorithm, but I've totally failed, because I _thought_ I 
> > understood what xdl_split was doing, but it appears I don't.
> 
> I thought about it, too, at the GitTogether, although I want to finish my 
> jGit diff first.
> 
> The main idea I had about patience diff is that you can reuse a lot of 
> functions in libxdiff.
> 
> But the requirement of taking just unique lines/hashes into account, and 
> _after_ splitting up, _again_ determine uniqueness _just_ in the 
> between-hunk part made me think that it may be wiser to have a separate 
> function for the patience diff stuff.
> 
> Basically, you would start to have libxdiff split the lines and hash them 
> as normal, but then determine the unique hashes (I'd start with the 
> smaller text, just to have a better chance to end up with unique hashes).
> 
> Once they are determined, you can search for those exact lines (hash 
> first) in the post-document.

Actually my current implementation just puts all the hashes into an
array, sorts them by hash (which is O(n log(n)) with the position from
left or right file it's in, ordered by increasing right position. (and
the struct is filled with the left pos to -1 if the right pos is set,
and vice versa).

The I scan the array to find patterns of two consecutive hashes exactly,
and collapse it into the proper {left pos, right pos} tuple if it was
indeed a unique line in both files.

This results into an array I sort again by right pos then, and we can
work on that for the stack sorting, and I do it, and then I have my LCS.


This is the complete brute-force algorithm which requires a temporary
array of the size of the number of lines on the left + the right, and a
temporary array for the stacks which _may_ end up being as large as the
smallest number of lines between the left or right file in the worst
case I'd say (roughly).

Then I just remember a list of split points, and I recurse in all the
sub splits again. It has a fixed point which may or may not need
libxdiff recursion in it.

This code is actually written, naive and unoptimized but it doesn't work
probably because I didn't plug that in the proper place :)

> Once that is done, you'd have to find the longest common subsequence, 
> which you could do using the existing infrastructure, but that would 
> require more work (as we already know the lines are unique).

Patience diff gives you the algorithm to do that, it's pretty simple,
and is more efficient than the current infrastructure (in time, I don't
know for space though).

> After that, you would have to recurse to the same algorithm _between_ 
> known chunks.  Eventually, that would have to resort to classical libxdiff 
> (if there are no, or not enough, unique lines).

Yeah, that's the point, the problem is, I believe more and more that I
should prepare the LCS from patience diff in xprepare.c, but I grok
absolutely nothing at what the chastore_t and similar stuff is. I
understand it's about hashing, but the exact stuff it does eludes me. In
fact when I look at the records I have in xdiffi.c I had the impression
they were already somehow collapsed, which makes it a too late point to
apply the patience diff ...

-- 
·O·  Pierre Habouzit
··O                                                madcoder@xxxxxxxxxx
OOO                                                http://www.madism.org

Attachment: pgpamVhpDm3sk.pgp
Description: PGP signature


[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux