Hi folks, I recently discovered --word-diff (or rather, --color-words and found --word-diff when I started to hack on the git master version) and I had hoped it would make the unified diffs generated by git-diff more readable. More specifically, I had expected to get a normal unified diff, with colouring added to highlight the changes within the normal - and + lines (so you don't have to review the entire changed line to see that just a single word or character has changed). E.g., I would like to see: -a <r>b</r> c +a <g>x</g> c Unfortunately, all --word-diff types currently departs from line-based - and + lines and show the new version of the file with the changed words (both old and new versions) shown inline, marked with coloring or {- ... -} kind of syntax. E.g., with --word-diff=color, the above would look like: a <r>b</r><g>x</g> c Personally, I think that the first example above is easier to read than the second one (at least for diffs of code). I was planning to let this mail be accompanied with a patch, so I've started hacking on this feature already. However, halfway through some cleanups and a prototype implementation of the above (breaking some of the other --word-diff formats in the process), I found that the current generalization of the different styles as stored in diff_words_styles[] does not apply cleanly enough to my intended output format. While trying to extend this generalization to something that would fit, I found that I don't actually understand the rationale behind --word-diff and the formats well enough to find a proper implemenation (see the link at the bottom of this email for the unfinished code I hacked up until now). So, here's some observations and questions about how --word-diff works or should work. Comments are welcome, both in general terms as well as in terms of the word-diff implementation (I know my way around there by now). Intended use ------------ First of all, it seems that the main intended use of --word-diff is for LaTeX or HTML documents or similar, where blocks of running text might be hard-wrapped (and thus rewrapped after a small change). In these cases, a small change in wording could cause a lot of whitespace to shift, resulting in a big normal diff. The current word-diff implementation therefore simply does not show the whitespace (or rather, non-word) changes, since they're usually not relevant to LaTeX anyway. Is this indeed the main usecase, or are there others I'm missing? Inexact output -------------- Secondly, the --word-diff output currently never displays any changes to the non-word (whitespace) parts of a file. This makes sense for the LaTeX case, but sometimes you might want to get exact diff output instead. At first glance this seems possible by specifiying a word-regex of "." or something similar (i.e., make sure that the word regex matches everything). But this is problematic for newlines. The documentation states that stuff gets silently ignored if a newline ends up inside a word. For the --word-diff=color format, this is probably a fixed limitation of the otput format: you can't give a color to a newline (or a space, for that matter). Including a newline inside a {- ... -} block should not be a problem with the --word-diff=plain format, and something similar can be argued for the porcelain format. An alternative approach would be to add a --word-diff-exact flag, which would cause the whitespace between to matches of the word regex to be treated as a word as well and have it included in the generate word-diff. This still leaves an implementation problem: To generate the word-diff, the current code looks at one patch hunk at a time, collecting all the plus and minus lines. It then splits those lines into words and generates two new "files" containing one word per line. It then applies a diff to this new document to get the word-diff. When a word would contain a newline, this would effectively mean the word would be split into two words for the word-diff, which will probably screw up the output. An obvious solution would be to use some escape sequence (e.g. \n) for a newline, though that might get messy and inefficient. An alternative that seems feasable is to use the empty word (i.e., an empty line in the word-diff "files") to mean a newline. This would mean that every newline always breaks a word into two, regardless of what the word regex is set to (but I guess that makes sense anyway?). I also think this would allow complete diff output wrt whitespace and newlines, for output formats that support it: plain, (modified) porcelain and my proposed format. Porcelain format ---------------- Lastly, the "porcelain" word-diff format seems a bit weird to me. Is the format specified somewhere, or are there any programs that use it currently? I couldn't find any users inside the git.git tree itself? Looking at the format itself, it's a bit unclear to me what the ~ lines mean exactly. Commit 882749, which introduced the format says the mean "newlines in the input", but I'm not sure if this means the old file, new file or both. In fact, it seems that this uncertainty makes the porcelain-format ambiguous wrt newlines. For example, these two diff hunks: @@ -1,3 +1,2 @@ a -b c @@ -1,3 +1,3 @@ a -b + c both look the same in porcelain format, except for the hunk header. @@ -1,3 +1,2 @@ a ~ -b ~ c ~ @@ -1,3 +1,3 @@ a ~ -b ~ c ~ This is somewhat expected, of course, since the --word-diff formats are documented to show only changes to words, not to non-words/whitespace. So I guess it is expected that the output is ambigious wrt whitespace, but if so, what is the use of this porcelain format? Wouldn't it be make a lot more sense to make the format unambiguous and make it do word-based diff at the same time? I think this should be possible because of the explicit notation used for the newline. For example, Specifying the ~ lines to mean a newline in the old, new or both files depending on the previous +, - or space prefixed line is probably enough for this. By generating empty +, - or space prefixed lines when needed, every occurence of ~ could be disambiguated. For example, the above two diff hunks would then become the following. The only difference is the near-empty line (just a space prefix) after -b in the second hunk. @@ -1,3 +1,2 @@ a ~ -b ~ c ~ @@ -1,3 +1,3 @@ a ~ -b ~ c ~ So, these are some thoughts I've had while hacking on the code. As said, suggestions are welcome. I'd like my hacking to result in some useful patches, but right now I'm unsure what direction(s) I should be thinking/working in. In case you're interested in the hacking I've done so far, I've put it up here: http://git.stderr.nl/gitweb?p=matthijs/upstream/git.git;a=shortlog;h=refs/heads/word-diff Most of it is broken or not properly tested, but it gets an idea what kinds of cleanup I've been doing. Gr. Matthijs
Attachment:
signature.asc
Description: Digital signature