Re: git grep: ^$ false match at end of file

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, Jan 10, 2025 at 01:59:18PM +0100, Andreas Schwab wrote:

> On Jan 10 2025, Jeff King wrote:
> 
> > but it is weird to me that patmatch() will match "^$" to the end of the
> > buffer at all. It is just calling regexec_buf() behind the scenes, so I
> > guess this is just a weird special case there, and may even depend on
> > the regex implementation.
> 
> Shouldn't the matcher be called with REG_NOTEOL in that case?

Perhaps. If regexec_buf() is assuming we are feeding lines, then without
REG_NOTEOL it thinks the end of the buffer is the end of a line. Which
makes sense, but trips up this case because we are not feeding lines,
but rather a whole buffer. So the final newline is not the start of an
empty line, but the true end of the buffer.

But what if the buffer doesn't end in a newline? In the example, the
file is something like "content\n".  But what if it was just "content"?
Then the end of the buffer really is the end of a line, isn't it? And
REG_NOTEOL would not be appropriate.

So without REG_NOTEOL:

  [this is wrong, per the report]
  $ echo content >file.txt
  $ git grep --no-index -n '^$' file.txt
  file.txt:2:

  [this is right]
  $ printf content >file.txt
  $ git grep --no-index -n '^$' file.txt
  $ echo $?
  1

and with it, like this patch:

diff --git a/grep.c b/grep.c
index 4e155ee9e6..7e3b6d9474 100644
--- a/grep.c
+++ b/grep.c
@@ -1467,7 +1467,7 @@ static int look_ahead(struct grep_opt *opt,
 		int hit;
 		regmatch_t m;
 
-		hit = patmatch(p, bol, bol + *left_p, &m, 0);
+		hit = patmatch(p, bol, bol + *left_p, &m, REG_NOTEOL);
 		if (hit < 0)
 			return -1;
 		if (!hit || m.rm_so < 0 || m.rm_eo < 0)

we get:

  [this is now right]
  $ git grep --no-index -n '^$' file.txt
  $ echo $?
  1

  [and this stays right]
  $ printf content >file.txt
  $ git grep --no-index -n '^$' file.txt
  $ echo $?
  1

but:

  [without REG_NOTEOL, this matches]
  $ printf content >file.txt
  $ git grep --no-index -n 't$' file.txt
  file.txt:1:content

  [but with that flag, it no longer does]
  $ printf content >file.txt
  $ git grep --no-index -n 't$' file.txt
  $ echo $?
  1

So I do think "\n" at the end of the buffer is a special case. Perhaps
we should always omit it, and then leave REG_NOTEOL unset, making the
end of the buffer consistently the end of the final line. Like this,
which no longer matches "^$" but does match "t$":

diff --git a/grep.c b/grep.c
index 4e155ee9e6..c4bb9f1081 100644
--- a/grep.c
+++ b/grep.c
@@ -1646,6 +1646,8 @@ static int grep_source_1(struct grep_opt *opt, struct grep_source *gs, int colle
 
 	bol = gs->buf;
 	left = gs->size;
+	if (left && gs->buf[left-1] == '\n')
+		left--;
 	while (left) {
 		const char *eol;
 		int hit;

-Peff




[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux