Hi, On Sat, 3 May 2008, Junio C Hamano wrote: > Johannes Schindelin <Johannes.Schindelin@xxxxxx> writes: > > > Now, you can specify which characters are to be interpreted as word > > characters with "--color-words=A-Za-z", or by setting the config variable > > diff.wordCharacters. > > > > Signed-off-by: Johannes Schindelin <johannes.schindelin@xxxxxx> > > --- > > > > I would have preferred an approach like this. > > Hmmm... Just to clarify: specifying word characters, and allowing sets (as specifyable for tr(1)). > > diff --git a/README b/README > > index 548142c..0e325e2 100644 > > --- a/README > > +++ b/README > > @@ -4,7 +4,7 @@ > > > > //////////////////////////////////////////////////////////////// > > > > -"git" can mean anything, depending on your mood. > > +"git" cann mean anything, depending on your mood. > > Heh. Yeah, I already said I am a moron. I can repeat it if it makes you happier ;-) > > @@ -456,7 +514,7 @@ static void diff_words_show(struct diff_words_data *diff_words) > > plus.ptr = xmalloc(plus.size); > > memcpy(plus.ptr, diff_words->plus.text.ptr, plus.size); > > for (i = 0; i < plus.size; i++) > > - if (isspace(plus.ptr[i])) > > + if (!word_character[(unsigned char)plus.ptr[i]]) > > plus.ptr[i] = '\n'; > > diff_words->plus.current = 0; > > I do not think there is much difference between specifying the set of > word characters and the set of non-word characters, especially as long > as your definition of "character" is limited to 8-bit bytes. By > enumerating word characters, your patch is letting the user specify non > word characters that are remainder from the 256-element set. By the > way, I think you meant to do the same for the "minus" side a few lines > above this hunk. I just imitated Ping's patch, but you're right, I forgot that. > I commented on the patch from Ping earier about a quite different issue. > I was wondering if we can avoid losing the non-word character > information. The original code replaces any isspace byte with LF, but a > whitespace is a whitespace is a whitespace so there won't be much loss > of information, but making the above isspace() configurable means that > now you are going to drop non-space non-word characters from the output > set. > > Instead of dropping the original character and replacing it with LF, I > thought a more sensible approach would be to _insert_ a line break > between runs of word characters and non-word characters (while probably > dropping a LF in the original). That is, instead of what the current > implementation of the above loop does to "ab c d" (i.e. rewrite it to > "ab\n\nc\nd"), rewrite it to "ab\n \nc\n \nd". Which feels more > consistent with the way how \b should work. The conversion to "\n" is done only because of limitations in libxdiff (did I not just rant about artificial limitations in another mail?), because it is married to the notion that LF ends a line. Now, there are two options: - try to reconstruct the original text from what libxdiff returns. This is potentially memory-efficient, but tricky, and therefore easy to get wrong. - go with your approach. You will have to duplicate all the text, so this is something quite heavy on memory consumption. But you have to do something special for _real_ LFs so that they are not stripped away when displaying the result. I like your idea (I was trying to come up with something sensible for the first option, but as I said, it is too tricky). But the LF issue is a real one. Ciao, Dscho -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html