"Ping Yin" <pkufranky@xxxxxxxxx> writes: > For this example,both "/if/while/ (i />/>=/ /1/0/)" and "/if/while/ > (i >//=/ /1/0/)" are fine to me. For the particular example, both are Ok, but for this other example: -if (i > 1... +if ((i > 1... it probably is better to treat each non-word character as a separate token, that is, it would be easier to read if we said "( stayed intact, and another ( was added", instead of saying "( is changed to ((". So "a run of punct chars" rule only sometimes produces better output but otherwise worse output, and to make it produce better output consistently, we would need to know the syntax of the target language for tokenization, i.e. ">=" and ">" are comparison operators, while "(" is a token and "((" is better split into two open-paren tokens. So as a very longer term subproject, we may want to teach the mechanism language specific tokenization rules, just like we can specify the hunk header pattern via gitattributes(5) to the diff output layer. Of course, I do not expect you to do that during this round --- and if we choose to keep the rule simple, I think it is probably better to use one-char-one-token rule for now. > And when designing, i think it's better to take multi-byte characters > into account. For multi-byte characters (especially CJK), every > character should be considered as a token. If we take an idealistic view for the longer term, we should be tokenizing even CJK sensibly, but unlike Occidental scripts, we cannot even use inter-word spacing for tokenizing hint, so unless we are willing to learn morphological analysis (which we are not for now), the best we can do is to use one-char-one-token rule. Side Note. For Japanese we could cheat and often do a slightly better job than simple one-char-one-token without having full morphological analysis by splicing between Kanji and Kana boundaries, but I'd prefer not to go there and keep the rules we would use to the minimum. I should stress that I said "character" in the above "punct" and "CJK" discussions, not "byte". -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html