Re: [PATCH] --color-words: Make the word characters configurable

Junio C Hamano <gitster@xxxxxxxxx> · Sat, 03 May 2008 10:43:22 -0700

Johannes Schindelin <Johannes.Schindelin@xxxxxx> writes:

> Now, you can specify which characters are to be interpreted as word 
> characters with "--color-words=A-Za-z", or by setting the config variable 
> diff.wordCharacters.
>
> Signed-off-by: Johannes Schindelin <johannes.schindelin@xxxxxx>
> ---
>
> 	I would have preferred an approach like this.

Hmmm...

> diff --git a/README b/README
> index 548142c..0e325e2 100644
> --- a/README
> +++ b/README
> @@ -4,7 +4,7 @@
>  
>  ////////////////////////////////////////////////////////////////
>  
> -"git" can mean anything, depending on your mood.
> +"git" cann mean anything, depending on your mood.

Heh.

> @@ -456,7 +514,7 @@ static void diff_words_show(struct diff_words_data *diff_words)
>  	plus.ptr = xmalloc(plus.size);
>  	memcpy(plus.ptr, diff_words->plus.text.ptr, plus.size);
>  	for (i = 0; i < plus.size; i++)
> -		if (isspace(plus.ptr[i]))
> +		if (!word_character[(unsigned char)plus.ptr[i]])
>  			plus.ptr[i] = '\n';
>  	diff_words->plus.current = 0;

I do not think there is much difference between specifying the set of word
characters and the set of non-word characters, especially as long as your
definition of "character" is limited to 8-bit bytes.  By enumerating word
characters, your patch is letting the user specify non word characters
that are remainder from the 256-element set.  By the way, I think you
meant to do the same for the "minus" side a few lines above this hunk.

I commented on the patch from Ping earier about a quite different issue.
I was wondering if we can avoid losing the non-word character information.
The original code replaces any isspace byte with LF, but a whitespace is a
whitespace is a whitespace so there won't be much loss of information, but
making the above isspace() configurable means that now you are going to
drop non-space non-word characters from the output set.

Instead of dropping the original character and replacing it with LF,
I thought a more sensible approach would be to _insert_ a line break
between runs of word characters and non-word characters (while probably
dropping a LF in the original).  That is, instead of what the current
implementation of the above loop does to "ab  c d" (i.e. rewrite it to
"ab\n\nc\nd"), rewrite it to "ab\n  \nc\n \nd".  Which feels more consistent
with the way how \b should work.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html