On Mon, Aug 14, 2017 at 4:57 PM, Jonathan Tan <jonathantanmy@xxxxxxxxxx> wrote: > The existing behavior of diff --color-moved=zebra does not define the > minimum size of a block at all, instead relying on a heuristic applied > later to filter out sets of adjacent moved lines that are shorter than 3 > lines long. This can be confusing, because a block could thus be colored > as moved at the source but not at the destination (or vice versa), > depending on its neighbors. > > Instead, teach diff that the minimum size of a block is 10 > non-whitespace characters. This allows diff to still exclude > uninteresting lines appearing on their own (such as those solely > consisting of one or a few closing braces), as was the intention of the > adjacent-moved-line heuristic. After some thought, I really like this heuristic, however allow me a moment to bikeshed 10 as a number here. One could think that 10 equals roughly 3 lines a 3 characters and in C based languages the shortest meaningful lines have more than 3 characters ("i++;", "a();", "int i;" have 4 or 5 each), but I would still think that 10 is too much, as we'd want to detect the closing braces in their own lines. > dimmed_zebra:: > Similar to 'zebra', but additional dimming of uninteresting parts > of moved code is performed. The bordering lines of two adjacent > diff --git a/diff.c b/diff.c > index f598d8a3a..305ce4126 100644 > --- a/diff.c > +++ b/diff.c > @@ -864,19 +864,28 @@ static int shrink_potential_moved_blocks(struct moved_entry **pmb, > /* > * If o->color_moved is COLOR_MOVED_PLAIN, this function does nothing. > * > - * Otherwise, if the last block has fewer lines than > - * COLOR_MOVED_MIN_BLOCK_LENGTH, unset DIFF_SYMBOL_MOVED_LINE on all lines in > - * that block. > + * Otherwise, if the last block has fewer non-space characters than > + * COLOR_MOVED_MIN_NON_SPACE_COUNT, unset DIFF_SYMBOL_MOVED_LINE on all lines > + * in that block. > * > * The last block consists of the (n - block_length)'th line up to but not > * including the nth line. > */ > static void adjust_last_block(struct diff_options *o, int n, int block_length) > { > - int i; > - if (block_length >= COLOR_MOVED_MIN_BLOCK_LENGTH || > - o->color_moved == COLOR_MOVED_PLAIN) > + int i, non_space_count = 0; > + if (o->color_moved == COLOR_MOVED_PLAIN) > return; > + for (i = 1; i < block_length + 1; i++) { > + const char *c = o->emitted_symbols->buf[n - i].line; > + for (; *c; c++) { > + if (isspace(*c)) > + continue; > + non_space_count++; > + if (non_space_count >= COLOR_MOVED_MIN_NON_SPACE_COUNT) > + return; When we do this counting, we could count the lines ourselves here as well. `n-block_count` should be equal to the line that has a different (flags & (DIFF_SYMBOL_MOVED_LINE | DIFF_SYMBOL_MOVED_LINE_ALT)) pattern than those before. (although we'd also have to check for i > 0, too) Your choice. > + } > + } > for (i = 1; i < block_length + 1; i++) > o->emitted_symbols->buf[n - i].flags &= ~DIFF_SYMBOL_MOVED_LINE; > } > @@ -923,7 +932,6 @@ static void mark_color_as_moved(struct diff_options *o, > } > > l->flags |= DIFF_SYMBOL_MOVED_LINE; > - block_length++; > > if (o->color_moved == COLOR_MOVED_PLAIN) > continue; > @@ -953,8 +961,13 @@ static void mark_color_as_moved(struct diff_options *o, > } > > flipped_block = (flipped_block + 1) % 2; > + > + adjust_last_block(o, n, block_length); > + block_length = 0; > } > > + block_length++; > + > if (flipped_block) > l->flags |= DIFF_SYMBOL_MOVED_LINE_ALT; > } > diff --git a/diff.h b/diff.h > index 5755f465d..9e2fece5b 100644 > --- a/diff.h > +++ b/diff.h > @@ -195,7 +195,7 @@ struct diff_options { > COLOR_MOVED_ZEBRA_DIM = 3, > } color_moved; > #define COLOR_MOVED_DEFAULT COLOR_MOVED_ZEBRA > - #define COLOR_MOVED_MIN_BLOCK_LENGTH 3 > + #define COLOR_MOVED_MIN_NON_SPACE_COUNT 10 > }; > > void diff_emit_submodule_del(struct diff_options *o, const char *line); > diff --git a/t/t4015-diff-whitespace.sh b/t/t4015-diff-whitespace.sh > index 6f7758e5c..d8e7b77b9 100755 > --- a/t/t4015-diff-whitespace.sh > +++ b/t/t4015-diff-whitespace.sh > @@ -1101,9 +1101,9 @@ test_expect_success 'detect malicious moved code, inside file' ' > <BRED>-{<RESET> > <BLUE>-if (!u->is_allowed_foo)<RESET> > <BLUE>-return;<RESET> > - <BRED>-foo(u);<RESET> > - <BLUE>-}<RESET> > - <BLUE>-<RESET> > + <RED>-foo(u);<RESET> > + <RED>-}<RESET> > + <RED>-<RESET> Here we have 2 blocks, the first has 7 character, which we may want to detect, the second has only 1 char. The longest "uninteresting" line in C like languages might be "\t } else {" which has 6 non-ws characters. Thinking of other languages (shell "fi" is uninteresting, others are interesting, Latex \"custom" all bets are off), I think we may want to go lower and have COLOR_MOVED_MIN_NON_SPACE_COUNT to be about 6 (~ 2 characters on a 3 line block). That said this is all bikeshedding, feel free to ignore. Acked-by: Stefan Beller <sbeller@xxxxxxxxxx>