Re: [PATCH v2 4/5] Make boundary characters for --color-words configurable

Junio C Hamano <junio@xxxxxxxxx> · Sun, 04 May 2008 13:16:47 -0700

Let's step back a bit and try to clarify the problem with a bit of
illustration.

The motivation behind "word diff" is because line oriented diff is
sometimes unwieldy.

    -Hello world.
    +Hi, world.

A naïve strategy to solve this would be to convert the input into one
character a line while changing the representation of characters into
their codepoints, take the diff between them, and synthesize the result
back, like this:

    preimage        postimage       char-diff
    48 H            48 H             48 H
    65 e                            -65 e
    6c l                            -6c l
    6c l                            -6c l
    6f o                            -6f o
                    69 i            +69 i
                    2c ,            +2c ,
    20 ' '          20 ' '           20 ' ' 
    77 w            77 w             77 w   
    6f o            6f o             6f o   
    72 r            72 r             72 r   
    6c l            6c l             6c l   
    64 d            64 d             64 d   
    2e .            2e .             2e .   
    0a '\n'         0a '\n'          0a '\n'

That would produce "H/ello/i,/ world.\n" which is very suboptimal for
human consumption because it chomps a word "Hello" and "Hi" in the middle.
We instead can do this word by word (note that I am doing this as a
thought experiment, to illustrate what the problem is and what should
conceptually happen, not suggesting this particular implementation):

    preimage        postimage       word-diff
    48656c6c6f                      -48656c6c6f Hello
                    4869            +4869       Hi
                    2c              +2c         ,
    20              20               20         ' '
    776f726c64      776f726c64       776f726c64 world      
    2e              2e               2e         .
    0a              0a               0a         '\n'

Which would give you "/Hello/Hi,/ world.\n".

Another my favorite example:

    -if (i > 1)
    +while (i >= 0)

    preimage       postimage        word-diff
    6966                            -6966       if
                   7768696c65       +7768696c65 while
    20             20                20         ' '
    28             28                28         (  
    69             69                69         i  
    20             20                20         ' '
    3e                              -3e         >
                   3e3d             +3e3d       >=
    20             20                20         ' '
    31                              -31         1  
                   30               +30         0  
    29             29                29         )

which should yield "/if/while/ (i />/>=/ /1/0/)".

So the overall algorithm I think should be is:

 - make the input into stream of tokens, where a token is either a run of
   word characters only, non-word punct characters only, or whitespaces
   only;

 - compute the diff over the stream of tokens;

 - emit common tokens in white, deleted in red and added in green.

Notice that you do not have to special case LF in any way if you go this
route.

You could do this with only two classes, and use a different tokenization
rule: a token is either a run of word characters only, or each byte of non
word character becomes individual token.  This however would yield a
suboptimal result:

    -if (i > 1)
    +while (i >= 0)

    preimage       postimage        word-diff
    6966                            -6966       if
                   7768696c65       +7768696c65 while
    20             20                20         ' '
    28             28                28         (  
    69             69                69         i  
    20             20                20         ' '
    3e             3e                3e         >
                   3d               +3d         =
    20             20                20         ' '
    31                              -31         1  
                   30               +30         0  
    29             29                29         )

This would give "/if/while/ (i >//=/ /1/0/)".  A logical unit ">=" is
chomped into two tokens, which is suboptimal for the same reason why the
output "H/ello/i,/" from the original char-diff based one was suboptimal.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html