Re: [PATCH v3 3/3] diff: define block by number of non-space chars

Stefan Beller <sbeller@xxxxxxxxxx> · Tue, 15 Aug 2017 13:06:49 -0700

On Tue, Aug 15, 2017 at 12:54 PM, Junio C Hamano <gitster@xxxxxxxxx> wrote:
> Jonathan Tan <jonathantanmy@xxxxxxxxxx> writes:
>
>> The existing behavior of diff --color-moved=zebra does not define the
>> minimum size of a block at all, instead relying on a heuristic applied
>> later to filter out sets of adjacent moved lines that are shorter than 3
>> lines long. This can be confusing, because a block could thus be colored
>> as moved at the source but not at the destination (or vice versa),
>> depending on its neighbors.
>>
>> Instead, teach diff that the minimum size of a block is 10
>> non-whitespace characters. This allows diff to still exclude
>> uninteresting lines appearing on their own (such as those solely
>> consisting of one or a few closing braces), as was the intention of the
>> adjacent-moved-line heuristic.
>
> I recall that there is a logic backed by a similar rationale in
> blame.c::blame_entry_score() but over there we count alnum, not
> !isspace, to judge if a block has been split into too small a piece
> to be significant.  I do not know which one is better, but if there
> is no strong reason, perhaps we want to unify the two, so that we
> can improve both heuristics at the same time?

In an ideal world we would use entropy of the diffed characters as
that is a best approximation on how much "interesting" things are
going on in that particular diff.

Computing the entropy is cumbersome, but maybe ok for this
purpose (we're most likely IO bound anyway, specifically when
including the human understanding as the last part of IO).

Reasons that may influence the choice
* Non latin/ASCII characters should also work.
* Do we care about the whitespace esoteric
  programming language?

The function blame_entry_score is documented to approach
this exact problem that we are trying to solve here, so I agree
we should have a common heuristic.

Stefan