Re: word-diff-regex=. sometimes ignores newlines

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Tamo,

On Tue, 3 Sep 2024, 高橋全 (Tamo) wrote:

> What did you do before the bug happened? (Steps to reproduce your issue)
>
> mkdir test
> cd test
> git init
>
> cat >a.txt <<EOF
> NRZ /NZRQ/NBRQ/
> NRZ(C) /NZRCQ/
> NRZ(M) /NZRMQ/
> EOF
>
> git add a.txt
> git commit -m 1
>
> cat >a.txt <<EOF
> NRZ /NZRMQ/NZRCQ/NZRQ/NBRQ/
> EOF
>
> git diff --word-diff-regex=.
>
>
> What did you expect to happen? (Expected behavior)
>
> diff --git a/a.txt b/a.txt
> index 278ea76..7e6f42f 100644
> --- a/a.txt
> +++ b/a.txt
> @@ -1,3 +1 @@
> NRZ /NZR{+M+}Q/N[-BRQ/-]{+ZRCQ/NZRQ/NBRQ/+}
> [-NRZ(C) /NZRCQ/-]
> [-NRZ(M) /NZRMQ/-]
>
> or anything whose hunk has three lines
>
>
> What happened instead? (Actual behavior)
>
> diff --git a/a.txt b/a.txt
> index 278ea76..7e6f42f 100644
> --- a/a.txt
> +++ b/a.txt
> @@ -1,3 +1 @@
> NRZ /NZR{+M+}Q/N[-BRQ/-]
> [-NRZ(C) /N-]ZRCQ/N[-R-]Z[-(M) -]{+RQ+}/N[-Z-]{+B+}R[-M-]Q/
>
>
>
> What's different between what you expected and what actually happened?
>
> some newlines are ignored
> and the length of the hunk is wrong;
> git says "@@ -1,3 +1 @@" but the hunk has only 2 lines

The reason is the regular expression, which does not match newlines. See
https://github.com/git/git/blob/v2.46.0/diff.c#L2268-L2270, which shows
how the regular expression is compiled:

		if (regcomp(ecbdata->diff_words->word_regex,
			    o->word_regex,
			    REG_EXTENDED | REG_NEWLINE))

Note the flag `REG_NEWLINE`, described in detail at
https://pubs.opengroup.org/onlinepubs/9699919799/functions/regcomp.html:

	If REG_NEWLINE is set, then <newline> shall be treated as an
	ordinary character except as follows:

	1. A <newline> in string shall not be matched by a <period>
	   outside a bracket expression or by any form of a non-matching
	   list (see XBD Regular Expressions).

You will note that you can see three lines in the output when using
`--word-diff-regex='[^ \t\n]+|[ \t\n]+'`:

	$ git diff --word-diff-regex='[^ \t\n]+|[ \t\n]+'
	diff --git a/a.txt b/a.txt
	index 278ea76..7e6f42f 100644
	--- a/a.txt
	+++ b/a.txt
	@@ -1,3 +1 @@
	NRZ [-/NZRQ/NBRQ/-]
	[-NRZ(C) /NZRCQ/-]
	[-NRZ(M) /NZRMQ/-]{+/NZRMQ/NZRCQ/NZRQ/NBRQ/+}

However, when including the slash in the boundary characters, the newlines
are suppressed again:

	$ git diff --word-diff-regex='[^/ \t\n]+|[/ \t\n]+'
	diff --git a/a.txt b/a.txt
	index 278ea76..7e6f42f 100644
	--- a/a.txt
	+++ b/a.txt
	@@ -1,3 +1 @@
	NRZ /[-NZRQ/NBRQ-]{+NZRMQ+}/[-NRZ(C) /-]NZRCQ/[-NRZ(M) /NZRMQ-]{+NZRQ/NBRQ+}/

I am fairly convinced that the reason for this behavior is that the word
diff machinery special-cases newlines and _never_ makes them part of the
"words", see https://github.com/git/git/blob/v2.46.0/diff.c#L2072-L2074
for the code implementing that logic.

Now, is this a bug? I can't really say. From my perspective, it is not:
When I implemented the original version of the word diff code, my use case
was LaTeX-formatted scientific articles, which traditionally do not
contain newline characters within paragraphs. I still have a hard time
wrapping my head around use cases where any pattern that includes a
newline would match a what is considered a word.

I do remember how I struggled (and punted) on the question how to display
newlines in word diffs. There just is no good way to do it that would
address all valid scenarios.

Ciao,
Johannes

[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux