On Thu, Jan 19, 2012 at 11:53 PM, Thomas Rast <trast@xxxxxxxxxxxxxxx> wrote: >[snip] > Under [^[:space:]]+ neither of the examples would work. Actually, > [^[:space:]]+ is the same as today's default, the [^[:space:]]* I > mentioned later is (strictly speaking) broken as it allows for a > 0-length match. (It doesn't really matter because IIRC the engine > ignores 0-length words.) My bad. >[snip] > I tried measuring it across a few commits, but it mostly gets drowned > out by the diff effort. For a commit with stat > > exercises/cgal/cover/cover.cpp | 5 +- > exercises/cgal/cover/cover.in1 |27014 +++++++++++++++----- > exercises/cgal/cover/cover.in2 |48996 +++++++++++++++++++++++------------ > exercises/cgal/cover/cover.in3 |55041 +++++++++++++++++++++++++-------------- > exercises/cgal/cover/cover.in4 |47600 ++++++++++++++++++++-------------- > exercises/cgal/cover/cover.int |43491 ++++++++++++++++++++++--------- > exercises/cgal/cover/cover.out1 | 53 +- > exercises/cgal/cover/cover.out2 | 24 +- > exercises/cgal/cover/cover.out3 | 11 +- > exercises/cgal/cover/cover.out4 | 2 +- > exercises/cgal/cover/cover.outt | 23 +- > exercises/cgal/cover/gen | 39 +- > exercises/cgal/cover/gen-1.cpp | 4 +- > exercises/cgal/cover/gen-2.cpp | 6 +- > exercises/cgal/cover/gen-3.cpp | 6 +- > > (sorry, can't share as those testcases are secret) I get best-of-5 > timings > > --word-diff-regex='[^[:space:]]+' 0:07.50real 7.40user 0.07system > --word-diff 0:07.47real 7.41user 0.03system > > In conclusion, "meh". I think ripping out the isspace() part would make > for a nice code reduction. Thanks for the numbers. Well, that agrees with the intuition that regex is slower than isspace(), since you have run it through the regex engine. >>> and your proposal is equivalent to >>> >>> [^[:space:]]|UTF_8_GUARD >>> >>> I think there is a case to be made for a default of >>> >>> [^[:space:]]|([[:alnum:]]|UTF_8_GUARD)+ >>> >>> or some such. There's a lot of bikeshedding lurking in the (non)extent >>> of the [[:alnum:]] here, however. >> >> Care to explain further? Not to sure what you mean here. > > For natural language, it may or may not make sense to match numbers as > part of a word. > > For typical use in e.g. emails, a lot of punctuation has a double role; > breaking words in > > http://article.gmane.org/gmane.comp.version-control.git/188391 > > may or may not make sense. > > For some uses, especially source code, it would be better to match an > underscore _ as part of a complete word, too. > > For some programming languages, say lisp, a dash - would also belong in > the same category. > > There's no real reason other than ease of implementation why the pattern > handles ASCII non-alphanumerics separately, but non-ASCII UTF-8 > non-alnums (like, say, unicode NO-BREAK SPACE which would show as \xc2 > \xa0) always goes into a word. But if you were to make UTF-8 sequences > a single word, text in (say) many European languages would become > chunked at accented letters. > > I'm sure you can find more items for this list. It's a grey area. Thanks. -- Cheers, Ray Chuan -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html