Re: [PATCH] --color-words: Make the word characters configurable

Johannes Schindelin <Johannes.Schindelin@xxxxxx> · Sun, 4 May 2008 10:25:39 +0100 (BST)

Hi,

On Sat, 3 May 2008, Junio C Hamano wrote:

> Johannes Schindelin <Johannes.Schindelin@xxxxxx> writes:
> 
> > Now, you can specify which characters are to be interpreted as word 
> > characters with "--color-words=A-Za-z", or by setting the config variable 
> > diff.wordCharacters.
> >
> > Signed-off-by: Johannes Schindelin <johannes.schindelin@xxxxxx>
> > ---
> >
> > 	I would have preferred an approach like this.
> 
> Hmmm...

Just to clarify: specifying word characters, and allowing sets (as 
specifyable for tr(1)).

> > diff --git a/README b/README
> > index 548142c..0e325e2 100644
> > --- a/README
> > +++ b/README
> > @@ -4,7 +4,7 @@
> >  
> >  ////////////////////////////////////////////////////////////////
> >  
> > -"git" can mean anything, depending on your mood.
> > +"git" cann mean anything, depending on your mood.
> 
> Heh.

Yeah, I already said I am a moron.  I can repeat it if it makes you 
happier ;-)

> > @@ -456,7 +514,7 @@ static void diff_words_show(struct diff_words_data *diff_words)
> >  	plus.ptr = xmalloc(plus.size);
> >  	memcpy(plus.ptr, diff_words->plus.text.ptr, plus.size);
> >  	for (i = 0; i < plus.size; i++)
> > -		if (isspace(plus.ptr[i]))
> > +		if (!word_character[(unsigned char)plus.ptr[i]])
> >  			plus.ptr[i] = '\n';
> >  	diff_words->plus.current = 0;
> 
> I do not think there is much difference between specifying the set of 
> word characters and the set of non-word characters, especially as long 
> as your definition of "character" is limited to 8-bit bytes.  By 
> enumerating word characters, your patch is letting the user specify non 
> word characters that are remainder from the 256-element set.  By the 
> way, I think you meant to do the same for the "minus" side a few lines 
> above this hunk.

I just imitated Ping's patch, but you're right, I forgot that.

> I commented on the patch from Ping earier about a quite different issue. 
> I was wondering if we can avoid losing the non-word character 
> information. The original code replaces any isspace byte with LF, but a 
> whitespace is a whitespace is a whitespace so there won't be much loss 
> of information, but making the above isspace() configurable means that 
> now you are going to drop non-space non-word characters from the output 
> set.
> 
> Instead of dropping the original character and replacing it with LF, I 
> thought a more sensible approach would be to _insert_ a line break 
> between runs of word characters and non-word characters (while probably 
> dropping a LF in the original).  That is, instead of what the current 
> implementation of the above loop does to "ab c d" (i.e. rewrite it to 
> "ab\n\nc\nd"), rewrite it to "ab\n \nc\n \nd".  Which feels more 
> consistent with the way how \b should work.

The conversion to "\n" is done only because of limitations in libxdiff 
(did I not just rant about artificial limitations in another mail?), 
because it is married to the notion that LF ends a line.

Now, there are two options:

- try to reconstruct the original text from what libxdiff returns.  This 
  is potentially memory-efficient, but tricky, and therefore easy to get 
  wrong.

- go with your approach.  You will have to duplicate all the text, so this 
  is something quite heavy on memory consumption.  But you have to do 
  something special for _real_ LFs so that they are not stripped away when 
  displaying the result.

I like your idea (I was trying to come up with something sensible for the 
first option, but as I said, it is too tricky).

But the LF issue is a real one.

Ciao,
Dscho

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html