Hi, On Tue, 4 Nov 2008, Pierre Habouzit wrote: > On Tue, Nov 04, 2008 at 05:39:48AM +0000, Johannes Schindelin wrote: > > > On Tue, 4 Nov 2008, Pierre Habouzit wrote: > > > > > I've been working tonight, trying to make libxdiff support the > > > patience diff algorithm, but I've totally failed, because I > > > _thought_ I understood what xdl_split was doing, but it appears I > > > don't. > > > > I thought about it, too, at the GitTogether, although I want to finish > > my jGit diff first. > > > > The main idea I had about patience diff is that you can reuse a lot of > > functions in libxdiff. > > > > But the requirement of taking just unique lines/hashes into account, > > and _after_ splitting up, _again_ determine uniqueness _just_ in the > > between-hunk part made me think that it may be wiser to have a > > separate function for the patience diff stuff. > > > > Basically, you would start to have libxdiff split the lines and hash > > them as normal, but then determine the unique hashes (I'd start with > > the smaller text, just to have a better chance to end up with unique > > hashes). > > > > Once they are determined, you can search for those exact lines (hash > > first) in the post-document. > > Actually my current implementation just puts all the hashes into an > array, sorts them by hash (which is O(n log(n)) with the position from > left or right file it's in, ordered by increasing right position. (and > the struct is filled with the left pos to -1 if the right pos is set, > and vice versa). Yeah, that would be much more efficient using a hash-multiset. Still linear size (albeit twice as much). And you can already discard double entries early (although you have to keep one pointer in the hash-multiset to prevent other identical lines from being misdetected as "unique"). > The I scan the array to find patterns of two consecutive hashes exactly, > and collapse it into the proper {left pos, right pos} tuple if it was > indeed a unique line in both files. > > This results into an array I sort again by right pos then, and we can > work on that for the stack sorting, and I do it, and then I have my LCS. > > > This is the complete brute-force algorithm which requires a temporary > array of the size of the number of lines on the left + the right, and a > temporary array for the stacks which _may_ end up being as large as the > smallest number of lines between the left or right file in the worst > case I'd say (roughly). > > Then I just remember a list of split points, and I recurse in all the > sub splits again. It has a fixed point which may or may not need > libxdiff recursion in it. I am not sure that you really end up with patience diff that way. I _think_ that you need to determine the longest sequence of unique lines which has the property of being ordered in both texts first, and only _then_ recurse into the not-yet-handled lines. > > Once that is done, you'd have to find the longest common subsequence, > > which you could do using the existing infrastructure, but that would > > require more work (as we already know the lines are unique). > > Patience diff gives you the algorithm to do that, it's pretty simple, > and is more efficient than the current infrastructure (in time, I don't > know for space though). Actually, IIRC it is pretty easy to see that the time complexity is linear (and therefore, the space complexity, too). > > After that, you would have to recurse to the same algorithm _between_ > > known chunks. Eventually, that would have to resort to classical > > libxdiff (if there are no, or not enough, unique lines). > > Yeah, that's the point, the problem is, I believe more and more that I > should prepare the LCS from patience diff in xprepare.c, but I grok > absolutely nothing at what the chastore_t and similar stuff is. I > understand it's about hashing, but the exact stuff it does eludes me. Yes, I do not like the short and unintuitive names either. AFAIU chastore_t is just a generic extensible array of elements that have size "isize", and initially there are "icount" of them. > In fact when I look at the records I have in xdiffi.c I had the > impression they were already somehow collapsed, which makes it a too > late point to apply the patience diff ... AFAICS xdiffi.c contains the classical diff algorithm (incidentally, I the inventor of that algorithm is about 50 meters away from me at this very moment). It should not have anything of interest to you, except for the fall-back case. So I think that you should add a new file xpatience.c. In that, I'd implement that hash multi-set, and use a prepared xdfenv_t to fill it (smaller file first, then you can traverse the other file, checking for uniqueness in that file and for a match in the other file at the same time). You _could_ build the longest list of ordered pairs at the same time, too, but that may make the code a bit too complex. Ciao, Dscho -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html