On Mon, May 5, 2008 at 1:00 PM, Junio C Hamano <gitster@xxxxxxxxx> wrote: > "Ping Yin" <pkufranky@xxxxxxxxx> writes: > > > For this example,both "/if/while/ (i />/>=/ /1/0/)" and "/if/while/ > > (i >//=/ /1/0/)" are fine to me. > > For the particular example, both are Ok, but for this other example: > > -if (i > 1... > +if ((i > 1... > > it probably is better to treat each non-word character as a separate > token, that is, it would be easier to read if we said "( stayed intact, > and another ( was added", instead of saying "( is changed to ((". > > So "a run of punct chars" rule only sometimes produces better output but > otherwise worse output, and to make it produce better output consistently, > we would need to know the syntax of the target language for tokenization, > i.e. ">=" and ">" are comparison operators, while "(" is a token and "((" > is better split into two open-paren tokens. > > So as a very longer term subproject, we may want to teach the mechanism > language specific tokenization rules, just like we can specify the hunk > header pattern via gitattributes(5) to the diff output layer. > > Of course, I do not expect you to do that during this round --- and if we > choose to keep the rule simple, I think it is probably better to use > one-char-one-token rule for now. > > > > And when designing, i think it's better to take multi-byte characters > > into account. For multi-byte characters (especially CJK), every > > character should be considered as a token. > > If we take an idealistic view for the longer term, we should be tokenizing > even CJK sensibly, but unlike Occidental scripts, we cannot even use > inter-word spacing for tokenizing hint, so unless we are willing to learn > morphological analysis (which we are not for now), the best we can do is > to use one-char-one-token rule. > > Side Note. For Japanese we could cheat and often do a slightly > better job than simple one-char-one-token without having full > morphological analysis by splicing between Kanji and Kana > boundaries, but I'd prefer not to go there and keep the rules we > would use to the minimum. > > I should stress that I said "character" in the above "punct" and "CJK" > discussions, not "byte". > The one-char-one-token and multi-char-one-token rules may have different implementation issues. I think multi-char-one-token rule may be more representative. So for the current time, i prefer considering both run of word characters and single non-word character as a token. -- Ping Yin -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html