Re: Understanding and improving --word-diff

Thomas Rast <trast@xxxxxxxxxxxxxxx> · Mon, 8 Nov 2010 20:22:13 +0100

Matthijs Kooijman wrote:
> This is somewhat expected, of course, since the --word-diff formats are
> documented to show only changes to words, not to non-words/whitespace.
> So I guess it is expected that the output is ambigious wrt whitespace,
> but if so, what is the use of this porcelain format? Wouldn't it be make
> a lot more sense to make the format unambiguous and make it do
> word-based diff at the same time? I think this should be possible
> because of the explicit notation used for the newline.

It's not just the newlines.  You will note that when diffing e.g.

  a x  c
vs.
  a  b c

then because the word-diff engine doesn't bother with the non-word
parts, you will end up with (using ordinary --word-diff)

  a  [-x-]{+b+} c

or using the porcelain format

  $ git diff --no-index --word-diff=porcelain a b | cat -E
  diff --git 1/a 2/b$
  index 4f47486..b040ae0 100644$
  --- 1/a$
  +++ 2/b$
  @@ -1 +1 @@$
   a  $
  -x$
  +b$
    c$
  ~$

Note how the second space after 'x' disappeared.  The case that
Wincent explains in the email he linked (thanks; I didn't even see it
the first time around) shows a more drastic example.  So the logic
that inserts the space again is too simplistic.

What it does not is something like (haven't checked again) always
insert the postimage space that went with the preceding word.  What it
*would* have to do is actually compute differences in the space, while
at the same time ignoring them for the purposes of the LCS algorithm.
If it weren't for the customizable word-regex, it might be enough to
put the spaces in the words again but enable the internal equivalent
of -b.

> Looking at the format itself, it's a bit unclear to me what the ~ lines
> mean exactly. Commit 882749, which introduced the format says the mean
> "newlines in the input", but I'm not sure if this means the old file,
> new file or both.

As Matthieu already said, the history of --word-diff=porcelain is
just: I needed a way to pipe word-diff output to gitk for coloring,
without any messing around with the color strings.  The ~ is the
minimal required feature to at least let git communicate *some*
linebreaks.

You are probably right in that while the format leaves open the
possibility of showing addition and removal of space (although the
current engine does not really attempt to compute it), I did not think
far enough to let it show addition and removal of whitespace.

> For example, Specifying the ~ lines to mean a newline in the old, new or
> both files depending on the previous +, - or space prefixed line is
> probably enough for this. By generating empty +, - or space prefixed
> lines when needed, every occurence of ~ could be disambiguated.

AFAICS that breaks down if you couple a newline-change-aware consumer
with a non-aware producer (i.e., a new frontend with an old git),
since there is currently no guarantee that the line before a ~ is
context.

So you would have to introduce a new format (porcelain2?) anyway, and
if you're already going that route, then for simplicity I'd rather
have something like ~+ (or so) instead of requiring the parser to
track state.

-- 
Thomas Rast
trast@{inf,student}.ethz.ch
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html