On 5/5/08, Junio C Hamano <junio@xxxxxxxxx> wrote: > > So the overall algorithm I think should be is: > > - make the input into stream of tokens, where a token is either a run of > word characters only, non-word punct characters only, or whitespaces > only; > > - compute the diff over the stream of tokens; > > - emit common tokens in white, deleted in red and added in green. > > Notice that you do not have to special case LF in any way if you go this > route. > > You could do this with only two classes, and use a different tokenization > rule: a token is either a run of word characters only, or each byte of non > word character becomes individual token. This however would yield a > suboptimal result: > > -if (i > 1) > +while (i >= 0) > > preimage postimage word-diff > 6966 -6966 if > 7768696c65 +7768696c65 while > 20 20 20 ' ' > 28 28 28 ( > 69 69 69 i > 20 20 20 ' ' > 3e 3e 3e > > 3d +3d = > 20 20 20 ' ' > 31 -31 1 > 30 +30 0 > 29 29 29 ) > > This would give "/if/while/ (i >//=/ /1/0/)". A logical unit ">=" is > chomped into two tokens, which is suboptimal for the same reason why the > output "H/ello/i,/" from the original char-diff based one was suboptimal. > For this example,both "/if/while/ (i />/>=/ /1/0/)" and "/if/while/ (i >//=/ /1/0/)" are fine to me. However, the run of non-word characters shouldn't always be considered as a single token. For example - ************** + ************ If just a '+' is removed, surely "************/*//" is better. And when designing, i think it's better to take multi-byte characters into account. For multi-byte characters (especially CJK), every character should be considered as a token. if we consider either a run of word characters or a run of non-word characters as a single token, there is no way to specify every character as a token. So from this viewpoint, is it better to use single-token character or something else instead of non-word character? Another consideration: Space information is also important for me when using --color-words. However, i can't distinguish between the removed spaces and added spaces in current implementaion. So how about use red/green background color for removed/added spaces? -- Ping Yin -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html