Re: [PATCH v2 4/5] Make boundary characters for --color-words configurable

"Ping Yin" <pkufranky@xxxxxxxxx> · Mon, 5 May 2008 09:40:47 +0800

On 5/5/08, Junio C Hamano <junio@xxxxxxxxx> wrote:
>
> So the overall algorithm I think should be is:
>
>  - make the input into stream of tokens, where a token is either a run of
>   word characters only, non-word punct characters only, or whitespaces
>   only;
>
>  - compute the diff over the stream of tokens;
>
>  - emit common tokens in white, deleted in red and added in green.
>
> Notice that you do not have to special case LF in any way if you go this
> route.
>
> You could do this with only two classes, and use a different tokenization
> rule: a token is either a run of word characters only, or each byte of non
> word character becomes individual token.  This however would yield a
> suboptimal result:
>
>    -if (i > 1)
>    +while (i >= 0)
>
>    preimage       postimage        word-diff
>    6966                            -6966       if
>                   7768696c65       +7768696c65 while
>    20             20                20         ' '
>    28             28                28         (
>    69             69                69         i
>    20             20                20         ' '
>    3e             3e                3e         >
>                   3d               +3d         =
>    20             20                20         ' '
>    31                              -31         1
>                   30               +30         0
>    29             29                29         )
>
> This would give "/if/while/ (i >//=/ /1/0/)".  A logical unit ">=" is
> chomped into two tokens, which is suboptimal for the same reason why the
> output "H/ello/i,/" from the original char-diff based one was suboptimal.
>

For this example,both "/if/while/ (i />/>=/ /1/0/)" and  "/if/while/
(i >//=/ /1/0/)" are fine to me. However, the run of non-word
characters shouldn't always be considered as a single token.

For example

  - **************
  + ************

If  just a '+' is removed, surely "************/*//" is better.

And when designing, i think it's better to take multi-byte characters
into account. For multi-byte characters (especially CJK), every
character should be considered as a token. if we consider either a run
of word characters or a run of non-word characters as a single token,
there is no way to specify every character as a token.

So from this viewpoint, is it better to use single-token character or
something else instead of non-word character?

Another consideration: Space information is also important for me when
using --color-words. However, i can't distinguish between the removed
spaces and added spaces in current implementaion. So how about use
red/green background color for removed/added spaces?

-- 
Ping Yin
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html