Re: [PATCH 3/4] color-words: refactor to allow for 0-character word boundaries

Thomas Rast <trast@xxxxxxxxxxxxxxx> · Mon, 12 Jan 2009 09:47:11 +0100

As a side remark, this patch makes a good use-case for --patience, and
is not isomorphic to the other edit-and-move examples; rather it's a
delete-and-edit.

Johannes Schindelin wrote:
> Subject: [PATCH 3/4] color-words: refactor to allow for 0-character word boundaries

I do not think the term "refactor" is accurate.  Wikipedia roughly
defines it as a code change that preserves all external semantics by
some standard method, and lists methods such as variable renaming,
common code extraction, etc.  You are actually completely replacing
the algorithm "under the hood" with a new one, so no such standard
method applies.

And there is also a tiny semantic change: compare

  A: a b  c
  B: x y  z
        ^^

The old version implicitly generated an empty line at the double
spaces (marked ^^), which subsequently became context and caused the
words to be printed as follows, where <..> is old and [..] is new:

  <a b >[x y ] <c>[z]

Your patched version does not generate empty lines for any space
whatsoever, not even for newlines.  Thus the result is

  <a b  c>[x y  z]

I think this is actually a good change, since it results in longer
chunks for "entirely rewritten" parts of the diff.  It also answers
Junio's question in the other thread:

Junio C Hamano wrote:
>> 
>> What happens if the input "language" does not have any inter-word spacing
>> but its words can still be expressed by regexp patterns?
>> 
>> ImagineALanguageThatAllowsYouToWriteSomethingLikeThis.  Does the mechanism
>> help users who want to do word-diff files written in such a language by
>> outputting:
>> 
>> 	ImagineALanguage<red>That</red><green>Which</green>AllowsYou...
>> 
>> when '[A-Z][a-z]*' is given by the word pattern?

Your patch handles this as a side-effect *even if the lines are
indented*, since no sequence of spaces whatsoever is special.  (Mine
would have given hard-to-predict results based on the number of
newlines between them, and xdiff's decision whether the newlines or
the words are more valuable as context.)

So I think this is actually an improvement, but the commit message
should point out the change in semantics.

> +static void fn_out_diff_words_aux(void *priv, char *line, unsigned long len)
>  {
> +	if (line[0] != '@' || parse_hunk_header(line, len,
> +			&minus_first, &minus_len, &plus_first, &plus_len))

It would be nice to have a comment here that points out that this
method crucially relies on having context length 0 (just as the old
one crucially relied on having the full text in a single hunk).

> +	for (i = 0; i < buffer->text.size; i++) {
> +		if (isspace(buffer->text.ptr[i]))
> +			continue;

I think it is this coupling of the loops to find a word, and to find a
word _beginning_, that comes back to haunt you in 4/4.  If the outer
loop was strictly about the words, you could use the regex match info
to find the beginning in the regex case.  This is probably cleaner
than attempting to force an anchored match, since at least the 'grep'
on my system takes '^^foo' to mean 'a "^foo" at the beginning of a
line', so you cannot just unconditionally insert a ^.  (Conditionally
inserting one seems even harder.)

These remarks aside (and the last one is the only one of relevance to
the code), this patch would be a vast improvement of the code even if
we weren't discussing it in the context of the regex feature.  So FWIW

  Acked-by: Thomas Rast <trast@xxxxxxxxxxxxxxx>

up to here.  I hope we can agree on some sane regex semantics for
4/4...

-- 
Thomas Rast
trast@{inf,student}.ethz.ch

Attachment:
signature.asc

Description: This is a digitally signed message part.