Re: Potential bug in --color-words output

Johannes Sixt <j6t@xxxxxxxx> · Tue, 8 Nov 2022 08:27:38 +0100

Am 28.10.22 um 23:08 schrieb Simeon Krastnikov:
> Hello,
> 
> Given an initial file with the contents "not to be", which I then change
> to "to be", the output of 'git diff --color-words', is
> 
>   notto be
> 
> with the first three letters colored red. To me this seems incorrect as
> it implies, or at least misleadingly suggests, that there was no space
> between "not" and "to" in the original file. (Even though in that case
> the output is actually "nottoto be" with the "notto" in red and "to" in
> green.)
> 
> If instead I start with a file with contents "to be", which I then
> change to "not to be", then the output is as expected:
> 
>   not to be
> 
> (First three letters colored green.)
> 
> Am I correct in seeing this as a bug? If so, any tips on what parts of
> diff.c to look at when starting a patch?

Well, not really. When you have a file with

   Line one.
   Line two.

then change it to

   Line ONE.
   Line TWO.

then --color-words currently prints it as

   Line one.ONE.
   Line two.TWO.

because it does not print the whitespace after[*] a sequence of deleted
words. But if it were printed, we would see

   Line one.
   ONE.
   Line two.
   TWO.

That is considered inferior; hence, it isn't printed.

The current algorithm produces sensible output in the vast majority of
cases while also being fairly straight-forward. To make it work "better"
(for some definition of that word) in the borderline cases, the
algorithm would have to be made considerably more sophisticated.

[*] It might be whitespace before a sequence of words, but that does not
change the gist of the argument.

-- Hannes