Re: [PATCH] t4062: stop using repetition in regex

Johannes Schindelin <Johannes.Schindelin@xxxxxx> · Wed, 9 Aug 2017 23:49:46 +0200 (CEST)

Hi all,

On Wed, 9 Aug 2017, David Coppa wrote:

> On Wed, Aug 9, 2017 at 4:15 PM, René Scharfe <l.s.r@xxxxxx> wrote:
> > Am 09.08.2017 um 08:15 schrieb René Scharfe:
> >> Am 09.08.2017 um 07:29 schrieb Junio C Hamano:
> >>> René Scharfe <l.s.r@xxxxxx> writes:
> >>>
> >>>> Am 09.08.2017 um 00:26 schrieb Junio C Hamano:
> >>>>> ... but in the meantime, I think replacing the test with "0$" to
> >>>>> force the scanner to find either the end of line or the end of the
> >>>>> buffer may be a good workaround.  We do not have to care how many of
> >>>>> random bytes are in front of the last "0" in order to ensure that
> >>>>> the regexec_buf() does not overstep to 4097th byte, while seeing
> >>>>> that regexec() that does not know how long the haystack is has to do
> >>>>> so, no?
> >>>>
> >>>> Our regexec() calls strlen() (see my other reply).
> >>>>
> >>>> Using "0$" looks like the best option to me.
> >>>
> >>> Yeah, it seems that way.  If we want to be close/faithful to the
> >>> original, we could do "^0*$", but the part that is essential to
> >>> trigger the old bug is not the "we have many zeroes" (or "we have
> >>> 4096 zeroes") part, but "zero is at the end of the string" part, so
> >>> "0$" would be the minimal pattern that also would work for OBSD.
> >>
> >> Thought about it a bit more.
> >>
> >> "^0{4096}$" checks if the byte after the buffer is \n or \0 in the
> >> hope of triggering a segfault.  On Linux I can access that byte just
> >> fine; perhaps there is no guard page.  Also there is a 2 in 256
> >> chance of the byte being \n or \0 (provided its value is random),
> >> which would cause the test to falsely report success.
> >>
> >> "0$" effectively looks for "0\n" or "0\0", which can only occur
> >> after the buffer.  If that string is found close enough then we
> >> may not trigger a segfault and report a false positive.
> >>
> >> In the face of unreliable segfaults we need to reverse our strategy,
> >> I think.  Searching for something not in the buffer (e.g. "1") and
> >> considering matches and segfaults as confirmation that the bug is
> >> still present should avoid any false positives.  Right?
> >
> > And that's not it either. *sigh*
> >
> > If the 4097th byte is NUL or LF then we can only hope its access
> > triggers a segfault -- there is no other way to distinguish the
> > result from a legitimate match when limiting with REG_STARTEND.  So
> > we have to accept this flakiness.
> >
> > We can check the value of that byte with [^0] and interpret a
> > match as failure, but that adds negation and makes the test more
> > complex.
> >
> > ^0*$ would falsely match if the 4097th byte (and possibly later
> > ones) is 0.  We need to make sure we check for end-of-line after
> > the 4096th byte, not later.
> >
> > Sorry, Dscho, I thought we could take a shortcut here, but -- as
> > you wrote all along -- we can't.
> >
> > So how about this?
> >
> > -- >8 --
> > Subject: [PATCH] t4062: use less than 256 repetitions in regex
> >
> > OpenBSD's regex library has a repetition limit (RE_DUP_MAX) of 255.
> > That's the minimum acceptable value according to POSIX.  In t4062 we use
> > 4096 repetitions in the test "-G matches", though, causing it to fail.
> > Combine two repetition operators, both less than 256, to arrive at 4096
> > zeros instead of using a single one, to fix the test on OpenBSD.
> >
> > Original-patch-by: David Coppa <dcoppa@xxxxxxxxxxx>
> > Signed-off-by: Rene Scharfe <l.s.r@xxxxxx>
> > ---
> >  t/t4062-diff-pickaxe.sh | 4 +++-
> >  1 file changed, 3 insertions(+), 1 deletion(-)
> >
> > diff --git a/t/t4062-diff-pickaxe.sh b/t/t4062-diff-pickaxe.sh
> > index 7c4903f497..1130c8019b 100755
> > --- a/t/t4062-diff-pickaxe.sh
> > +++ b/t/t4062-diff-pickaxe.sh
> > @@ -14,8 +14,10 @@ test_expect_success setup '
> >         test_tick &&
> >         git commit -m "A 4k file"
> >  '
> > +
> > +# OpenBSD only supports up to 255 repetitions, so repeat twice for 64*64=4096.
> >  test_expect_success '-G matches' '
> > -       git diff --name-only -G "^0{4096}$" HEAD^ >out &&
> > +       git diff --name-only -G "^(0{64}){64}$" HEAD^ >out &&
> >         test 4096-zeroes.txt = "$(cat out)"
> >  '
> >
> > --
> > 2.14.0
> 
> I think this should work w/o problems on OpenBSD:
> 
> $ uname -a
> OpenBSD open.bsdgeek.it 6.1 GENERIC#54 amd64
> $ git diff --name-only -G "^0{4096}$" HEAD^ 1>/dev/null
> fatal: invalid regex: invalid repetition count(s)
> $ git diff --name-only -G "^(0{64}){64}$" HEAD^ 1>/dev/null
> $

Perfect!

Thank y'all for being so thorough,
Dscho