On 08/04/2016 09:39 PM, Junio C Hamano wrote: > Michael Haggerty <mhagger@xxxxxxxxxxxx> writes: > >>>> + } >>>> + /* >>>> + * We have reached the end of the line without finding any non-space >>>> + * characters; i.e., the whole line consists of trailing spaces, which we >>>> + * are not interested in. >>>> + */ >>>> + return -1; > > Not related to Jacob's review, but "the whole line consists of > trailing spaces" made me read it twice; while it is technically > correct, "the whole line consists of spaces", or even "this is a > blank line", would read a lot more easily, at least for me. Hehe, yes, ETOOMANYWORDS. >> I was implicitly assuming that such lines would have text somewhere >> after those 200 spaces (or 25 TABs or whatever). But you're right, the >> line could consist only of whitespace. Unfortunately, the only way to >> distinguish these two cases is to read the rest of the line, which is >> exactly what we *don't* want to do. > > Hmm, why is it exactly what we don't want to do? Is it a > performance concern? In other words, is it because this function is > called many times to measure the same line multiple times? Yes, performance is the reason, and especially the desire to avoid unreasonable runtimes for pathological cases. Thanks for asking, though, because it's worthwhile to look into this more rigorously. Suppose there is a slider that can be shifted to any of `numshift` positions. Then we have to call `measure_split()` `2*numshift` times (once for the beginning and once for the end of each candidate slider position). Suppose there are `numblanks` blank lines in the neighborhood of the slider. Each time we call `measure_split()`, we count the number of blank lines before and after the proposed split position. So we end up calling `get_indent()` `2*numshift*numblanks` times. Finally, suppose that the blank lines each contain `numws` whitespace characters. Then each call to `get_indent()` has to do `O(numws)` work. So altogether, if there were no limits, then the amount of work to position a slider would scale like O(numshift * numblanks * numws) However, the total number of characters in the file might only be O(numblanks * numws) So without limits, the amount of work to position sliders could scale by numshift times the size of the file. The worst case is pretty easy to achieve. Just create a file consisting of a million or so LF characters, then add another LF to it. The diff would be a slider with numshift = 1000000 numblanks = 1000000 numws = 1 so the heuristic would take O(N^2) in the size of the file. Effectively the limits cap the effective `numblanks` at `2*MAX_BLANKS` (which is 2*20) and the effective `numws` at `MAX_INDENT` (which is 200), meaning that the maximum effort scales like numshift * 2*20 * 200 which is still a big number but not absurd. Empirically, for the case described above, `git diff` takes 104 ms and `git diff --indent-heuristic` takes 490 ms. I think that's not prohibitive for a pathological case. Meanwhile, Myers's algorithm scales like O(ND), where N is the number of lines and D is the edit distance, so I suppose that it is already possible to find diffs that are intractable to compute. > After > all, somebody in this file is already scanning each and every line > to see where it ends to split the input into records, so perhaps a > "right" (if the "theoretical correctness" of the return value from > this function mattered, which you wave-away below) optimization > could be to precompute it while the lines are broken into records > and store it in the "rec" structure? That would certainly be possible, and would help in cases where there are a lot of lines with lots of leading whitespace. You could get nearly the same benefit by recording a single bit in struct rec, indicating whether the line is blank or not. But it wouldn't help the worst case described above, where each call to `git_indent()` is already very cheap. And I didn't think it was worth allocating the extra memory to optimize this heuristic * which isn't used all that often in the first place, * which (for normal inputs) doesn't take a significant amount of time, and * when the optimization doesn't improve the worst-case scenario (and thus any DoS potential) I think the only way to ensure O(size_of_file) runtime in the above case would be to record, along with each line, how many blank lines immediately precede and succeed it. You could achieve something like O(size_of_file lg(size_of_file)) by storing, e.g., the total number of nonblank lines that precede each line and doing a binary search to find the nearest non-blank line. >> But I think it doesn't matter anyway. Such "text" will likely never be >> read by a human, so it's not a big deal if the slider position is not >> picked perfectly. And remember, this whole saga is just to improve the >> aesthetics of the diff. The diff is *correct* (e.g., in the sense of >> applicable) regardless of where we position the sliders. > > A better argument may be "if the user is truly reading a diff output > for such an unusual "text", it is likely that she has a very wide > display and/or running less -S, and treating such an overindented line > as if it were a blank line would give a result that is more consistent > to what appears on her display", perhaps? I don't know. It seems like a pretty contrived justification for what is basically, "your input is too weird for us. We're not going to break our necks trying to give you the best possible slider positioning." Michael -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html