On Sun, Jul 1, 2018 at 8:57 AM Michael Haggerty <mhagger@xxxxxxxxxxxx> wrote: > > On 06/29/2018 10:28 PM, Stefan Beller wrote: > > [...] > > Adds some threshold to avoid expensive cases, like: > > > > ``` > > #!python > > open('a', 'w').write(" \n" * 1000000) > > open('b', 'w').write(" \n" * 1000001) > > ``` > > > > The indent heuristic is O(N * 20) (N = 1000000) for the above case, and > > makes diff much slower. > > [...] > > +/* > > + * For indentation heuristic, skip searching for better slide position after > > + * checking MAX_BORING lines without finding an improvement. This defends the > > + * indentation heuristic logic against pathological cases. The value is not > > + * picked scientifically but should be good enough. > > + */ > > +#define MAX_BORING 100 > > This is an interesting case, and a speed difference of almost a factor > of five seems impressive. But this is a pretty pathological case, isn't > it? And I'm pretty sure that the algorithm is `O(N)` both before and > after this change. Remember that to find `earliest_end` and `g.end`, > there has already been a scan through all 1000000 lines. In other words, > you're not improving how the overall algorithm scales with `N`; you're > only changing the constant factor in front. So it's a little bit > questionable whether it is worth complicating the code for this unusual > case. > > But *if* we want to improve this case, I think that we could be smarter > about it. > > By the time we get to this point in the code, we already know that there > is a "slider" hunk of length `M` (`groupsize`) that can be slid up or > down within a range of `N` (`g.end - earliest_end + 1`) possible > positions. The interesting case here is `N ≫ M`, because then naively > the number of positions to try out is a lot bigger than the size of the > hunk itself. (In the case described above, `N` is 1000000 and `M` is 1.) > > But how can that situation even arise? Remember, a hunk can only be slid > down by a line if the first line *after* the hunk is identical to the > first line *of* the hunk. It follows that if you shift a hunk down `M` > lines, then it has the same contents as when you started—you've just > rotated all of the hunk lines around once. > > So if `N ≫ M`, there is necessarily a lot of repetition among the `N + > M` lines that the hunk could possibly overlay. Specifically, it must > consist of `floor((N + M)/M)` identical copies of the hunk, plus > possibly a few leftover lines constituting the start of another repetition. > > Given this large amount of repetition, it seems to me that there is > never a need to scan more than the bottom `M + 1` possible positions [1] > plus the highest possible position [2] to be sure of finding the very > best one. In the pathological case that you described above, where `M` > is 1, only three positions have to be evaluated, not 100. > > In fact, it *could* be that there is even more repetition, namely if the > hunk itself contains multiple copies of an even shorter block of `K` > lines. In that case, you would only have to scan `K + 1` positions at > the bottom plus one at the top to be sure to find the best hunk > position. This would be an interesting optimization for a case like > > > open('a', 'w').write(" \n" * 1000000) > > open('b', 'w').write(" \n" * 1100000) > > (`N = 1000000`, `M = 100000`, `K = 1`) or > > > open('a', 'w').write("<item>\nMISSING\n</item>\n" * 1000000) > > open('b', 'w').write("<item>\nMISSING\n</item>\n" * 1100000) > > (`N = 3000000`, `M = 300000`, `K = 3`). On the other hand, it's not > entirely trivial to find periodicity in a group of lines (i.e., to find > `K`), and I don't know offhand how that task scales with `M`. > > Michael > > [1] Actually, to be rigorously correct it might be necessary to check > even a bit more than `M + 1` positions at the bottom because the > heuristic looks a bit beyond the lines of the hunk. > > [2] The position at the top has different predecessor lines than the > other positions, so it could have a lower score than all of the others. > It's worth checking it. Here too, to be rigorously correct it might be > necessary to check more than one position at the top because the > heuristic looks a bit beyond the lines of the hunk. So this suggests to have MAX_BORING to be "hunk size + some small constant offset" ? Stefan