Re: [PATCH v2 4/5] Make boundary characters for --color-words configurable

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



"Ping Yin" <pkufranky@xxxxxxxxx> writes:

> For this example,both "/if/while/ (i />/>=/ /1/0/)" and  "/if/while/
> (i >//=/ /1/0/)" are fine to me.

For the particular example, both are Ok, but for this other example:

	-if (i > 1...
        +if ((i > 1...

it probably is better to treat each non-word character as a separate
token, that is, it would be easier to read if we said "( stayed intact,
and another ( was added", instead of saying "( is changed to ((".

So "a run of punct chars" rule only sometimes produces better output but
otherwise worse output, and to make it produce better output consistently,
we would need to know the syntax of the target language for tokenization,
i.e. ">=" and ">" are comparison operators, while "(" is a token and "(("
is better split into two open-paren tokens.

So as a very longer term subproject, we may want to teach the mechanism
language specific tokenization rules, just like we can specify the hunk
header pattern via gitattributes(5) to the diff output layer.

Of course, I do not expect you to do that during this round --- and if we
choose to keep the rule simple, I think it is probably better to use
one-char-one-token rule for now.

> And when designing, i think it's better to take multi-byte characters
> into account. For multi-byte characters (especially CJK), every
> character should be considered as a token.

If we take an idealistic view for the longer term, we should be tokenizing
even CJK sensibly, but unlike Occidental scripts, we cannot even use
inter-word spacing for tokenizing hint, so unless we are willing to learn
morphological analysis (which we are not for now), the best we can do is
to use one-char-one-token rule.

	Side Note.  For Japanese we could cheat and often do a slightly
	better job than simple one-char-one-token without having full
	morphological analysis by splicing between Kanji and Kana
	boundaries, but I'd prefer not to go there and keep the rules we
	would use to the minimum.

I should stress that I said "character" in the above "punct" and "CJK"
discussions, not "byte".

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux