Re: [PATCH 2/2] diff --word-diff: use non-whitespace regex by default

Tay Ray Chuan <rctay89@xxxxxxxxx> · Wed, 18 Jan 2012 15:32:29 +0800

On Thu, Jan 12, 2012 at 5:22 PM, Thomas Rast <trast@xxxxxxxxxxxxxxx> wrote:
> [snip]
> Case in point, consider my patch sent out yesterday
>
>  http://article.gmane.org/gmane.comp.version-control.git/188391
>
> It consists of a one-hunk doc update.  word-diff is not brilliant:
>
>  -k::
>          Usually the program [-'cleans up'-]{+removes email cruft from+} the Subject:
>          header line to extract the title line for the commit log
>          [-message,-]
>  [-      among which (1) remove 'Re:' or 're:', (2) leading-]
>  [-      whitespaces, (3) '[' up to ']', typically '[PATCH]', and-]
>  [-      then prepends "[PATCH] ".-]{+message.+}  This [-flag forbids-]{+option prevents+} this munging, and is most
>          useful when used to read back 'git format-patch -k' output.
> [snip the rest as it's only {+}]
>
> But character-diff tries too hard to find common subsequences:
>
>  $ g show HEAD^^ --word-diff-regex='[^[:space:]]' | xsel
>[snip]
>  w-]{+.  T+}hi[-te-]s[-paces, (3) '[' up t-] o[-']', ty-]p[
>
> is just line noise?  The colors don't even help as most of it is removed
> (red).

You missed the '+' quantifier, as in

  [^[:space:]]+

Using that regex, that abomination of a word-diff that you mentioned
disappears, like this:

-k::
	Usually the program [-'cleans up'-]{+removes email cruft from+} the Subject:
	header line to extract the title line for the commit log
	[-message,-]
[-	among which (1) remove 'Re:' or 're:', (2) leading-]
[-	whitespaces, (3) '[' up to ']', typically '[PATCH]', and-]
[-	then prepends "[PATCH] ".-]{+message.+}  This [-flag
forbids-]{+option prevents+} this munging, and is most
	useful when used to read back 'git format-patch -k' output.

> [snip]
> That being said, I can see some arguments for changing the default to
> split punctuation into a separate word.  That is, whereas the current
> default is semantically equivalent to a wordRegex of
>
>  [^[:space:]]*
>
> (but has a faster code path)

Oh right, there *is* a sensible default implemented in. Somehow I was
under the impression that there wasn't.

I wonder which is faster, using the non-whitespace regex, or the
isspace() calls...

> and your proposal is equivalent to
>
>  [^[:space:]]|UTF_8_GUARD
>
> I think there is a case to be made for a default of
>
>  [^[:space:]]|([[:alnum:]]|UTF_8_GUARD)+
>
> or some such.  There's a lot of bikeshedding lurking in the (non)extent
> of the [[:alnum:]] here, however.

Care to explain further? Not to sure what you mean here.

-- 
Cheers,
Ray Chuan
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html