Re: [PATCH v3 2/4] word diff: customizable word splits

Junio C Hamano <gitster@xxxxxxxxx> · Sun, 11 Jan 2009 14:20:07 -0800

Thomas Rast <trast@xxxxxxxxxxxxxxx> writes:

> Allows for user-configurable word splits when using --color-words.
> This can make the diff more readable if the regex is configured
> according to the language of the file.
>
> Each non-overlapping match of the regex is a word; everything in
> between is whitespace.

What happens if the input "language" does not have any inter-word spacing
but its words can still be expressed by regexp patterns?

ImagineALanguageThatAllowsYouToWriteSomethingLikeThis.  Does the mechanism
help users who want to do word-diff files written in such a language by
outputting:

	ImagineALanguage<red>That</red><green>Which</green>AllowsYou...

when '[A-Z][a-z]*' is given by the word pattern?

> We disallow matching the empty string (because
> it results in an endless loop) or a newline (breaks color escapes and
> interacts badly with the input coming from the usual line diff).  To
> help the user, we set REG_NEWLINE so that [^...] and . do not match
> newlines.

AndImagineALanguageWhoseWordStruc
tureDoesNotCareAboutLineBreak

Can you help users with such payload?

	Side note.  Yes, I am coming from Japanese background.

        Side note 2.  No, I am not saying your code must support both of
        the above to be acceptable.  I am just gauging the design
        assumptions and limitations.

> Insertion of spaces is somewhat subtle.  We echo a "context" space
> twice (once on each side of the diff) if it follows directly after a
> word.  While this loses a tiny bit of accuracy, it runs together long
> sequences of changed word into one removed and one added block, making
> the diff much more readable.

I guess this part can be later enhanced to be more precise, so that it
keeps the original context space more faithfully (i.e. does not lose two
consecutive spaces in the original occidental script, and does not insert
any extra space to the oriental script), if we were to support the second
example I gave above in the future as a follow-up patch.

> +--color-words[=<regex>]::
> +	Show colored word diff, i.e., color words which have changed.
> +	By default, a new word only starts at whitespace, so that a
> +	'word' is defined as a maximal sequence of non-whitespace
> +	characters.  The optional argument <regex> can be used to
> +	configure this.
> ++
> +The <regex> must be an (extended) regular expression.  When set, every
> +non-overlapping match of the <regex> is considered a word.  (Regular
> +expression semantics ensure that quantifiers grab a maximal sequence
> +of characters.)  Anything between these matches is considered
> +whitespace and ignored for the purposes of finding differences.  You
> +may want to append `|\S` to your regular expression to make sure that
> +it matches all non-whitespace characters.

Whose regexp library do we assume here?  Traditionally we limited
ourselves to POSIX BRE, and I do not think anybody minds using POSIX ERE
here, but we need to be clear.  In either case \S is a pcre outside
POSIX.

The rest I only skimmed but did not spot anything glaringly wrong; thanks.

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html