On Wed, May 07, 2008 at 01:02:54PM -0700, Junio C Hamano wrote: > I suspect that heavily depends on the input text. If you drop "different" > in the example, the output becomes: > > {-This|+Here} is {-a|+some} {-complete|+totally} {-sentence.|+text.} > > which is totally sensible. > > [...] > > which would yield on the output: > > {-This |+Here }is {-a complete sentence.|+some totally different text.} Sensible, perhaps, but I think the second one is much nicer for English text (though the first is much nicer for code, I expect). > It's all in diff_words_tokenize(), which I kept deliberately stupid so > that people can tweak it to their liking. OK; I haven't been following the thread too closely, and I wanted to make sure this was a question of how the tokenizing works, and not a fundamental problem with this approach. Thanks for the explanation. -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html