Thomas Rast <trast@xxxxxxxxxxxxxxx> writes: > Allows for user-configurable word splits when using --color-words. > This can make the diff more readable if the regex is configured > according to the language of the file. > > Each non-overlapping match of the regex is a word; everything in > between is whitespace. What happens if the input "language" does not have any inter-word spacing but its words can still be expressed by regexp patterns? ImagineALanguageThatAllowsYouToWriteSomethingLikeThis. Does the mechanism help users who want to do word-diff files written in such a language by outputting: ImagineALanguage<red>That</red><green>Which</green>AllowsYou... when '[A-Z][a-z]*' is given by the word pattern? > We disallow matching the empty string (because > it results in an endless loop) or a newline (breaks color escapes and > interacts badly with the input coming from the usual line diff). To > help the user, we set REG_NEWLINE so that [^...] and . do not match > newlines. AndImagineALanguageWhoseWordStruc tureDoesNotCareAboutLineBreak Can you help users with such payload? Side note. Yes, I am coming from Japanese background. Side note 2. No, I am not saying your code must support both of the above to be acceptable. I am just gauging the design assumptions and limitations. > Insertion of spaces is somewhat subtle. We echo a "context" space > twice (once on each side of the diff) if it follows directly after a > word. While this loses a tiny bit of accuracy, it runs together long > sequences of changed word into one removed and one added block, making > the diff much more readable. I guess this part can be later enhanced to be more precise, so that it keeps the original context space more faithfully (i.e. does not lose two consecutive spaces in the original occidental script, and does not insert any extra space to the oriental script), if we were to support the second example I gave above in the future as a follow-up patch. > +--color-words[=<regex>]:: > + Show colored word diff, i.e., color words which have changed. > + By default, a new word only starts at whitespace, so that a > + 'word' is defined as a maximal sequence of non-whitespace > + characters. The optional argument <regex> can be used to > + configure this. > ++ > +The <regex> must be an (extended) regular expression. When set, every > +non-overlapping match of the <regex> is considered a word. (Regular > +expression semantics ensure that quantifiers grab a maximal sequence > +of characters.) Anything between these matches is considered > +whitespace and ignored for the purposes of finding differences. You > +may want to append `|\S` to your regular expression to make sure that > +it matches all non-whitespace characters. Whose regexp library do we assume here? Traditionally we limited ourselves to POSIX BRE, and I do not think anybody minds using POSIX ERE here, but we need to be clear. In either case \S is a pcre outside POSIX. The rest I only skimmed but did not spot anything glaringly wrong; thanks. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html