Re: [PATCH v2 4/5] Make boundary characters for --color-words configurable

"Ping Yin" <pkufranky@xxxxxxxxx> · Mon, 5 May 2008 20:10:11 +0800

On Mon, May 5, 2008 at 1:00 PM, Junio C Hamano <gitster@xxxxxxxxx> wrote:
> "Ping Yin" <pkufranky@xxxxxxxxx> writes:
>
>  > For this example,both "/if/while/ (i />/>=/ /1/0/)" and  "/if/while/
>  > (i >//=/ /1/0/)" are fine to me.
>
>  For the particular example, both are Ok, but for this other example:
>
>         -if (i > 1...
>         +if ((i > 1...
>
>  it probably is better to treat each non-word character as a separate
>  token, that is, it would be easier to read if we said "( stayed intact,
>  and another ( was added", instead of saying "( is changed to ((".
>
>  So "a run of punct chars" rule only sometimes produces better output but
>  otherwise worse output, and to make it produce better output consistently,
>  we would need to know the syntax of the target language for tokenization,
>  i.e. ">=" and ">" are comparison operators, while "(" is a token and "(("
>  is better split into two open-paren tokens.
>
>  So as a very longer term subproject, we may want to teach the mechanism
>  language specific tokenization rules, just like we can specify the hunk
>  header pattern via gitattributes(5) to the diff output layer.
>
>  Of course, I do not expect you to do that during this round --- and if we
>  choose to keep the rule simple, I think it is probably better to use
>  one-char-one-token rule for now.
>
>
>  > And when designing, i think it's better to take multi-byte characters
>  > into account. For multi-byte characters (especially CJK), every
>  > character should be considered as a token.
>
>  If we take an idealistic view for the longer term, we should be tokenizing
>  even CJK sensibly, but unlike Occidental scripts, we cannot even use
>  inter-word spacing for tokenizing hint, so unless we are willing to learn
>  morphological analysis (which we are not for now), the best we can do is
>  to use one-char-one-token rule.
>
>         Side Note.  For Japanese we could cheat and often do a slightly
>         better job than simple one-char-one-token without having full
>         morphological analysis by splicing between Kanji and Kana
>         boundaries, but I'd prefer not to go there and keep the rules we
>         would use to the minimum.
>
>  I should stress that I said "character" in the above "punct" and "CJK"
>  discussions, not "byte".
>

The one-char-one-token and multi-char-one-token rules may have
different implementation issues. I think multi-char-one-token rule may
be more representative. So for the current time, i prefer considering
both run of word characters and single non-word character as a token.

-- 
Ping Yin
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html