Let's step back a bit and try to clarify the problem with a bit of illustration. The motivation behind "word diff" is because line oriented diff is sometimes unwieldy. -Hello world. +Hi, world. A naïve strategy to solve this would be to convert the input into one character a line while changing the representation of characters into their codepoints, take the diff between them, and synthesize the result back, like this: preimage postimage char-diff 48 H 48 H 48 H 65 e -65 e 6c l -6c l 6c l -6c l 6f o -6f o 69 i +69 i 2c , +2c , 20 ' ' 20 ' ' 20 ' ' 77 w 77 w 77 w 6f o 6f o 6f o 72 r 72 r 72 r 6c l 6c l 6c l 64 d 64 d 64 d 2e . 2e . 2e . 0a '\n' 0a '\n' 0a '\n' That would produce "H/ello/i,/ world.\n" which is very suboptimal for human consumption because it chomps a word "Hello" and "Hi" in the middle. We instead can do this word by word (note that I am doing this as a thought experiment, to illustrate what the problem is and what should conceptually happen, not suggesting this particular implementation): preimage postimage word-diff 48656c6c6f -48656c6c6f Hello 4869 +4869 Hi 2c +2c , 20 20 20 ' ' 776f726c64 776f726c64 776f726c64 world 2e 2e 2e . 0a 0a 0a '\n' Which would give you "/Hello/Hi,/ world.\n". Another my favorite example: -if (i > 1) +while (i >= 0) preimage postimage word-diff 6966 -6966 if 7768696c65 +7768696c65 while 20 20 20 ' ' 28 28 28 ( 69 69 69 i 20 20 20 ' ' 3e -3e > 3e3d +3e3d >= 20 20 20 ' ' 31 -31 1 30 +30 0 29 29 29 ) which should yield "/if/while/ (i />/>=/ /1/0/)". So the overall algorithm I think should be is: - make the input into stream of tokens, where a token is either a run of word characters only, non-word punct characters only, or whitespaces only; - compute the diff over the stream of tokens; - emit common tokens in white, deleted in red and added in green. Notice that you do not have to special case LF in any way if you go this route. You could do this with only two classes, and use a different tokenization rule: a token is either a run of word characters only, or each byte of non word character becomes individual token. This however would yield a suboptimal result: -if (i > 1) +while (i >= 0) preimage postimage word-diff 6966 -6966 if 7768696c65 +7768696c65 while 20 20 20 ' ' 28 28 28 ( 69 69 69 i 20 20 20 ' ' 3e 3e 3e > 3d +3d = 20 20 20 ' ' 31 -31 1 30 +30 0 29 29 29 ) This would give "/if/while/ (i >//=/ /1/0/)". A logical unit ">=" is chomped into two tokens, which is suboptimal for the same reason why the output "H/ello/i,/" from the original char-diff based one was suboptimal. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html