[RFD] Combine diff improvements

Antoine Pelisse <apelisse@xxxxxxxxx> · Thu, 14 Mar 2013 23:32:35 +0100

Hi,
I've been working on combine diff recently and have seen that the current
implementation does not produce optimal results for coalescing lost lines.

I would like to discuss some improvements.

I think we can define an optimal solution as the shortest diff
output. That is that we coalesced as much line as we could.
It means that if each parent has the same file removed, the output will
be n lines long (all lines coalesce to each others).

The current implementation doesn't look for optimal solution but for the
easiest/greedy solution. It also stops matching with a parent if it
can't find a match. As an example, it means that losing [1, 2, 3, 4] from
parent A, and [3, 1, 2, 4] from parent B would give:
- 1
- 2
--3
- 4
 -1
 -2
 -4

While if we invert the two parents order we will have:
- 3
--1
--2
- 4
 -3
 -4

While clearly, in both cases, the optimal solution would be close to:
- 3
--1
--2
 -3
--4

Let's say we have only one empty file (deleted), with p parents. In each
parent, the file was n lines long, but the lines don't have to be the
same.

It clearly looks like an extension/variation of the Longest Common
Subsequence (LCS) problem, but with p sequences (and the common
subsequences doesn't have to be common to all sequences).

LCS dynamic programming is O(n²) in time and space complexity for two
sequences.

We could extend it this way:
After processing two of the p parents, let's find LCS and build a new
optimal list of removals. Then find LCS again between this new list and
the third parent removals. And repeat until we made each parents.

Best-case analysis:
All p parents have the same n lines.
We will find LCS and provide a n lines (the same lines) new list in
O(n²), and then run it again in O(n²) with the next parent, etc.
It will end-up being O(pn²).

Worst-case analysis:
All p parents have no lines in common.
We will find LCS and provide a 2n new list in O(n²).
Then we run it again in O(2n x n), and again O(3n x n), etc, until
O(pn x n).
When we sum these all, we end-up with O(p² x n²)

Does it make any sense ? And is it worth implementing ?

Cheers,
Antoine
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html