Re: [PATCH 2/2] diff --word-diff: use non-whitespace regex by default

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Tay Ray Chuan <rctay89@xxxxxxxxx> writes:

> On Thu, Jan 12, 2012 at 4:05 AM, Thomas Rast <trast@xxxxxxxxxxxxxxx> wrote:
>> Tay Ray Chuan <rctay89@xxxxxxxxx> writes:
>>
>>> Factor out the comprehensive non-whitespace regex in use by PATTERNS and
>>> IPATTERN and use it as the word-diff regex for the default diff driver.
>>
>> Why?

Sorry for distracting you with the performance argument; it was mostly
the first thing that came to my mind that I could use to ask for the
motivation, and evaluation of tradeoffs, that both were missing from the
proposed commit message.

> But I think it's worthwhile to trade-off performance for a sensible
> default. Something like
>
>   matrix[a,b,c]
>   matrix[d,b,c]
>
> gives
>
>   matrix[[-a-]{+d+},b,c]
>
> and when we have
>
>   ImagineALanguageLikeFoo
>   ImagineALanguageLikeBar
>
> we get
>
>   ImagineALanguageLike[-Foo-]{+Bar+}

In that case (and I should have read the original patch), I am
definitely against this change.  It turns the default word-diff into
character-diff, which is something entirely different, and frequently
useless precisely for the reason you state:

> (But I cheated. Foo and Bar have no common characters in common; if
> they did, the word diff would be messy.)

Case in point, consider my patch sent out yesterday

  http://article.gmane.org/gmane.comp.version-control.git/188391

It consists of a one-hunk doc update.  word-diff is not brilliant:

  -k::
          Usually the program [-'cleans up'-]{+removes email cruft from+} the Subject:
          header line to extract the title line for the commit log
          [-message,-]
  [-      among which (1) remove 'Re:' or 're:', (2) leading-]
  [-      whitespaces, (3) '[' up to ']', typically '[PATCH]', and-]
  [-      then prepends "[PATCH] ".-]{+message.+}  This [-flag forbids-]{+option prevents+} this munging, and is most
          useful when used to read back 'git format-patch -k' output.
[snip the rest as it's only {+}]

But character-diff tries too hard to find common subsequences:

  $ g show HEAD^^ --word-diff-regex='[^[:space:]]' | xsel
  -k::
          Usually the program [-'cl-]{+remov+}e[-an-]s {+email cr+}u[-p'-]{+ft from+} the Subject:
          header line to extract the title line for the commit log
          message[-,-]
  [-      among which (1) remove 'Re:' or 're:', (2) leading-]
  [-      w-]{+.  T+}hi[-te-]s[-paces, (3) '[' up t-] o[-']', ty-]p[-ically '[PATCH]', and-]t[-he-]{+io+}n pre[-p-]{+v+}en[-ds "[PATCH] ".  This flag forbid-]{+t+}s this munging, and is most
          useful when used to read back 'git format-patch -k' output.
[snip]

Wouldn't you agree that

  w-]{+.  T+}hi[-te-]s[-paces, (3) '[' up t-] o[-']', ty-]p[

is just line noise?  The colors don't even help as most of it is removed
(red).

Regarding your examples

> [1] http://article.gmane.org/gmane.comp.version-control.git/105896
> [2] http://article.gmane.org/gmane.comp.version-control.git/105237

first please notice that both of them were written before (and actually
discussing) the introduction of the wordRegex feature.  At this point,
we were trying to make up our minds w.r.t. how powerful the feature
needs to be.  Nowadays (or in fact, starting a few days after those
emails) the user can easily achieve everything discussed here by setting
the wordRegex to taste.

That being said, I can see some arguments for changing the default to
split punctuation into a separate word.  That is, whereas the current
default is semantically equivalent to a wordRegex of

  [^[:space:]]*

(but has a faster code path) and your proposal is equivalent to

  [^[:space:]]|UTF_8_GUARD

I think there is a case to be made for a default of

  [^[:space:]]|([[:alnum:]]|UTF_8_GUARD)+

or some such.  There's a lot of bikeshedding lurking in the (non)extent
of the [[:alnum:]] here, however.

-- 
Thomas Rast
trast@{inf,student}.ethz.ch
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]