Thomas Rast wrote: > I just took the laziest (and most obvious) approach possible when I > wrote the original patterns. I think the second most laziest one > would be to observe that bit patterns for leading characters are > always 11.., while those for continuation chars are 10.. > > So that gives > > |[\xc0-\xff][\x80-\xbf]+ Yes, that's what I was thinking of. v2 will be a two-part series starting with that. BTW, the perl token matcher is pretty half-hearted. In part this is because "only perl can parse perl" [1] terrifies me and in part it is because I am too lazy to write down the state machine implied by PPI/Token/*.pm. If some tokenization wizard would like to work on it, something like the following might produce more pleasant word diffs: "[%&$][[:space:]]*[0-9]+" /* $1 */ "|[%&$][[:space:]]*([[:alpha:]_']|::)([[:alnum:]_']|::)*" /* $var1 */ "|[%&$][[:space:]]*\\$([[:alnum:]_]|::)([[:alnum:]_']|::)*" /* $$var1 */ "|[%&$][[:space:]]*\\$\\{" /* $${ introducing complicated expression */ "|[%&$][[:space:]]*\\$\\$" /* $$$ introducing complicated expression */ "|[%&$][[:space:]]*[^[:alnum:]_:'^$]" /* $! */ "|[%&$][[:space:]]*\\^[][A-Z\\^_?]" /* $^A */ "|[%&$][[:space:]]*\\{\\^[][A-Z\\^_?]\\}" /* ${^A} */ "|[%&$][[:space:]]*\\{\\^[][A-Z\\^_?][[:alnum:]_]*\\}" /* ${^Foo} */ /* ${var} */ "|[%&$][[:space:]]*\\{[[:space:]]*([[:alpha:]_']|::)[[:alnum:]_:]*[[:space:]]\\}" "|[%&$][[:space:]]*\\{" /* ${ introducing complicated expression */ ... though it is an unmaintainable mess. :) [1] perl::toke.c and http://www.perlmonks.org/?node_id=44722 -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html