Re: [PATCH] t4062: stop using repetition in regex

Junio C Hamano <gitster@xxxxxxxxx> · Tue, 08 Aug 2017 15:09:01 -0700

René Scharfe <l.s.r@xxxxxx> writes:

> Am 08.08.2017 um 16:49 schrieb Johannes Schindelin:
>> Hi René,
>> 
>> On Tue, 8 Aug 2017, René Scharfe wrote:
>> 
>>> OpenBSD's regex library has a repetition limit (RE_DUP_MAX) of 255.
>>> That's the minimum acceptable value according to POSIX.  In t4062 we use
>>> 4096 repetitions in the test "-G matches", though, causing it to fail.
>>>
>>> Do the same as the test "-S --pickaxe-regex" in the same file and search
>>> for a single zero instead.  That still suffices to trigger the buffer
>>> overrun in older versions (checked with b7d36ffca02^ and --valgrind on
>>> Linux), simplifies the test a bit, and avoids exceeding OpenBSD's limit.
>> 
>> I am afraid not. The 4096 is precisely the page size required to trigger
>> the bug on Windows against which this regression test tries to safeguard.
>
> Checked with b7d36ffca02^ on MinGW now as well and found that it
> segfaults with the proposed change ten out of ten times.

That is a strange result but I can believe it.

The reason why I find it strange is that the test wants to run
diff_grep() in diffcore-pickaxe.c with one == NULL (because we are
looking at difference between an initial empty commit and the
current commit that adds 4096-zeroes.txt file), which makes the
current blob (i.e. a page of '0' that may be mmap(2)ed without
trailing NUL to terminate it) scanned via regexec() to look for the
search string.

I can understand why Dscho originally did "^0{4096}$"; it is to
ensure that the whole page is scanned for 4096 zeroes and then the
code would somehow make sure that there is no more byte until the
end of line, which will force regexec (but not regexec_buf that knows
where the buffer ends) to look at the 4097th byte that does not exist.

If you change the pattern to just "0" that is not anchored, I'd expect
regexec() that does not know how big the haystack is to just find "0"
at the first byte and happily return without segfaulting (because it
does not even have to scan the remainder of the buffer).

So I find Dscho's concern quite valid, even though I do believe you
when you say the code somehow segfaults.  I just can not tell
how/why it would segfault, though---it is possible that regexec()
implementation is stupid and does not realize that it can return early
reporting success without looking at the rest of the buffer, but
somehow I find it unlikely.

Puzzled.

> You get different results?  How is that possible?  The search string is
> NUL-terminated in each case, while the point of the test is that the
> file contents isn't, right?