Re: [PATCH 2/2] diff --word-diff: use non-whitespace regex by default

Tay Ray Chuan <rctay89@xxxxxxxxx> · Fri, 20 Jan 2012 09:14:51 +0800

On Thu, Jan 19, 2012 at 11:53 PM, Thomas Rast <trast@xxxxxxxxxxxxxxx> wrote:
>[snip]
> Under [^[:space:]]+ neither of the examples would work.  Actually,
> [^[:space:]]+ is the same as today's default, the [^[:space:]]* I
> mentioned later is (strictly speaking) broken as it allows for a
> 0-length match.  (It doesn't really matter because IIRC the engine
> ignores 0-length words.)

My bad.

>[snip]
> I tried measuring it across a few commits, but it mostly gets drowned
> out by the diff effort.  For a commit with stat
>
>  exercises/cgal/cover/cover.cpp  |    5 +-
>  exercises/cgal/cover/cover.in1  |27014 +++++++++++++++-----
>  exercises/cgal/cover/cover.in2  |48996 +++++++++++++++++++++++------------
>  exercises/cgal/cover/cover.in3  |55041 +++++++++++++++++++++++++--------------
>  exercises/cgal/cover/cover.in4  |47600 ++++++++++++++++++++--------------
>  exercises/cgal/cover/cover.int  |43491 ++++++++++++++++++++++---------
>  exercises/cgal/cover/cover.out1 |   53 +-
>  exercises/cgal/cover/cover.out2 |   24 +-
>  exercises/cgal/cover/cover.out3 |   11 +-
>  exercises/cgal/cover/cover.out4 |    2 +-
>  exercises/cgal/cover/cover.outt |   23 +-
>  exercises/cgal/cover/gen        |   39 +-
>  exercises/cgal/cover/gen-1.cpp  |    4 +-
>  exercises/cgal/cover/gen-2.cpp  |    6 +-
>  exercises/cgal/cover/gen-3.cpp  |    6 +-
>
> (sorry, can't share as those testcases are secret) I get best-of-5
> timings
>
>  --word-diff-regex='[^[:space:]]+'    0:07.50real 7.40user 0.07system
>  --word-diff                          0:07.47real 7.41user 0.03system
>
> In conclusion, "meh".  I think ripping out the isspace() part would make
> for a nice code reduction.

Thanks for the numbers. Well, that agrees with the intuition that
regex is slower than isspace(), since you have run it through the
regex engine.

>>> and your proposal is equivalent to
>>>
>>>  [^[:space:]]|UTF_8_GUARD
>>>
>>> I think there is a case to be made for a default of
>>>
>>>  [^[:space:]]|([[:alnum:]]|UTF_8_GUARD)+
>>>
>>> or some such.  There's a lot of bikeshedding lurking in the (non)extent
>>> of the [[:alnum:]] here, however.
>>
>> Care to explain further? Not to sure what you mean here.
>
> For natural language, it may or may not make sense to match numbers as
> part of a word.
>
> For typical use in e.g. emails, a lot of punctuation has a double role;
> breaking words in
>
>  http://article.gmane.org/gmane.comp.version-control.git/188391
>
> may or may not make sense.
>
> For some uses, especially source code, it would be better to match an
> underscore _ as part of a complete word, too.
>
> For some programming languages, say lisp, a dash - would also belong in
> the same category.
>
> There's no real reason other than ease of implementation why the pattern
> handles ASCII non-alphanumerics separately, but non-ASCII UTF-8
> non-alnums (like, say, unicode NO-BREAK SPACE which would show as \xc2
> \xa0) always goes into a word.  But if you were to make UTF-8 sequences
> a single word, text in (say) many European languages would become
> chunked at accented letters.
>
> I'm sure you can find more items for this list.  It's a grey area.

Thanks.

-- 
Cheers,
Ray Chuan
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html