Re: [PATCH] t4062: stop using repetition in regex

Junio C Hamano <gitster@xxxxxxxxx> · Wed, 09 Aug 2017 09:07:53 -0700

René Scharfe <l.s.r@xxxxxx> writes:

>> In the face of unreliable segfaults we need to reverse our strategy,
>> I think.  Searching for something not in the buffer (e.g. "1") and
>> considering matches and segfaults as confirmation that the bug is
>> still present should avoid any false positives.  Right?
>
> And that's not it either. *sigh*
>
> If the 4097th byte is NUL or LF then we can only hope its access
> triggers a segfault -- there is no other way to distinguish the
> result from a legitimate match when limiting with REG_STARTEND.  So
> we have to accept this flakiness.

> We can check the value of that byte with [^0] and interpret a
> match as failure, but that adds negation and makes the test more
> complex.
>
> ^0*$ would falsely match if the 4097th byte (and possibly later
> ones) is 0.  We need to make sure we check for end-of-line after
> the 4096th byte, not later.

I do not have a strong objection against "^0{64}{64}$", but let me
make sure that I understand your rationale correctly.

We assume that we do not have an unmapped guard page to reliably
trap us.  So "^0{64}{64}$" will report a false success when that
4097th byte is either NUL or LF.  There is 2/256 probability of that
happening but we accept it.

A pattern "0$" will give us a false success in the same case, but
the reason why it is worse is because in addition, we get a false
success if that second page begins with "0\0" or "0\n".  The chance
of that happening is additional 2/256 * 1/256.  Worse yet, the page
could start with "00\0" or "00\n", which is additional 2/256 *
1/65536.  We could have yet another "0" at the beginning of that
second page, which only increases the chance of us getting a false
sucess.

We consider that 2/256 * (1/256 + 1/256^2 + 1/256^3 + ...) is too
big an additional chance of getting false success to be acceptable,
even though 2/256 which is the chance of false success that we have
to and are willing to accept.

Now I am not good at doing math on the keyboard and text terminal,
but

    X    = (1/256 + 1/256^2 + ...)
    256X = 1 + X
    X    = 1/255

so that additional rate of false success is 2/256 * 1/255.

So we are saying that we accept ~1/100 false success rate, but
additional ~1/30000 is unacceptable.

I do not know if I buy that argument, but I do think that additional
false success rate, even if it is miniscule, is unnecessary.  So as
long as everybody's regexp library is happy with "^0{64}{64}$",
let's use that.

Thanks.