Hi Tamo, On Tue, 3 Sep 2024, 高橋全 (Tamo) wrote: > What did you do before the bug happened? (Steps to reproduce your issue) > > mkdir test > cd test > git init > > cat >a.txt <<EOF > NRZ /NZRQ/NBRQ/ > NRZ(C) /NZRCQ/ > NRZ(M) /NZRMQ/ > EOF > > git add a.txt > git commit -m 1 > > cat >a.txt <<EOF > NRZ /NZRMQ/NZRCQ/NZRQ/NBRQ/ > EOF > > git diff --word-diff-regex=. > > > What did you expect to happen? (Expected behavior) > > diff --git a/a.txt b/a.txt > index 278ea76..7e6f42f 100644 > --- a/a.txt > +++ b/a.txt > @@ -1,3 +1 @@ > NRZ /NZR{+M+}Q/N[-BRQ/-]{+ZRCQ/NZRQ/NBRQ/+} > [-NRZ(C) /NZRCQ/-] > [-NRZ(M) /NZRMQ/-] > > or anything whose hunk has three lines > > > What happened instead? (Actual behavior) > > diff --git a/a.txt b/a.txt > index 278ea76..7e6f42f 100644 > --- a/a.txt > +++ b/a.txt > @@ -1,3 +1 @@ > NRZ /NZR{+M+}Q/N[-BRQ/-] > [-NRZ(C) /N-]ZRCQ/N[-R-]Z[-(M) -]{+RQ+}/N[-Z-]{+B+}R[-M-]Q/ > > > > What's different between what you expected and what actually happened? > > some newlines are ignored > and the length of the hunk is wrong; > git says "@@ -1,3 +1 @@" but the hunk has only 2 lines The reason is the regular expression, which does not match newlines. See https://github.com/git/git/blob/v2.46.0/diff.c#L2268-L2270, which shows how the regular expression is compiled: if (regcomp(ecbdata->diff_words->word_regex, o->word_regex, REG_EXTENDED | REG_NEWLINE)) Note the flag `REG_NEWLINE`, described in detail at https://pubs.opengroup.org/onlinepubs/9699919799/functions/regcomp.html: If REG_NEWLINE is set, then <newline> shall be treated as an ordinary character except as follows: 1. A <newline> in string shall not be matched by a <period> outside a bracket expression or by any form of a non-matching list (see XBD Regular Expressions). You will note that you can see three lines in the output when using `--word-diff-regex='[^ \t\n]+|[ \t\n]+'`: $ git diff --word-diff-regex='[^ \t\n]+|[ \t\n]+' diff --git a/a.txt b/a.txt index 278ea76..7e6f42f 100644 --- a/a.txt +++ b/a.txt @@ -1,3 +1 @@ NRZ [-/NZRQ/NBRQ/-] [-NRZ(C) /NZRCQ/-] [-NRZ(M) /NZRMQ/-]{+/NZRMQ/NZRCQ/NZRQ/NBRQ/+} However, when including the slash in the boundary characters, the newlines are suppressed again: $ git diff --word-diff-regex='[^/ \t\n]+|[/ \t\n]+' diff --git a/a.txt b/a.txt index 278ea76..7e6f42f 100644 --- a/a.txt +++ b/a.txt @@ -1,3 +1 @@ NRZ /[-NZRQ/NBRQ-]{+NZRMQ+}/[-NRZ(C) /-]NZRCQ/[-NRZ(M) /NZRMQ-]{+NZRQ/NBRQ+}/ I am fairly convinced that the reason for this behavior is that the word diff machinery special-cases newlines and _never_ makes them part of the "words", see https://github.com/git/git/blob/v2.46.0/diff.c#L2072-L2074 for the code implementing that logic. Now, is this a bug? I can't really say. From my perspective, it is not: When I implemented the original version of the word diff code, my use case was LaTeX-formatted scientific articles, which traditionally do not contain newline characters within paragraphs. I still have a hard time wrapping my head around use cases where any pattern that includes a newline would match a what is considered a word. I do remember how I struggled (and punted) on the question how to display newlines in word diffs. There just is no good way to do it that would address all valid scenarios. Ciao, Johannes