Re: [PATCH] t4062: stop using repetition in regex

René Scharfe <l.s.r@xxxxxx> · Wed, 9 Aug 2017 19:20:43 +0200

Am 09.08.2017 um 18:07 schrieb Junio C Hamano:
> René Scharfe <l.s.r@xxxxxx> writes:
> 
>>> In the face of unreliable segfaults we need to reverse our strategy,
>>> I think.  Searching for something not in the buffer (e.g. "1") and
>>> considering matches and segfaults as confirmation that the bug is
>>> still present should avoid any false positives.  Right?
>>
>> And that's not it either. *sigh*
>>
>> If the 4097th byte is NUL or LF then we can only hope its access
>> triggers a segfault -- there is no other way to distinguish the
>> result from a legitimate match when limiting with REG_STARTEND.  So
>> we have to accept this flakiness.
> 
>> We can check the value of that byte with [^0] and interpret a
>> match as failure, but that adds negation and makes the test more
>> complex.
>>
>> ^0*$ would falsely match if the 4097th byte (and possibly later
>> ones) is 0.  We need to make sure we check for end-of-line after
>> the 4096th byte, not later.
> 
> I do not have a strong objection against "^0{64}{64}$", but let me
> make sure that I understand your rationale correctly.
> 
> We assume that we do not have an unmapped guard page to reliably
> trap us.  So "^0{64}{64}$" will report a false success when that
> 4097th byte is either NUL or LF.  There is 2/256 probability of that
> happening but we accept it.
> 
> A pattern "0$" will give us a false success in the same case, but
> the reason why it is worse is because in addition, we get a false
> success if that second page begins with "0\0" or "0\n".  The chance
> of that happening is additional 2/256 * 1/256.  Worse yet, the page
> could start with "00\0" or "00\n", which is additional 2/256 *
> 1/65536.  We could have yet another "0" at the beginning of that
> second page, which only increases the chance of us getting a false
> sucess.

I demonstrated a lack of logical thinking and now you bring on
probabilities!? ;-)

There could be any characters except NUL and LF between the 4096 zeros
and "0$" for the latter to match wrongly, no?  So there are 4095
opportunities for the misleading pattern in a page, with probabilities
like this:

  0$                          1/256 * 2/256
  .0$         254/256       * 1/256 * 2/256
  ..0$       (254/256)^2    * 1/256 * 2/256
  .{3}0$     (254/256)^3    * 1/256 * 2/256

  .{4094}0$  (254/256)^4094 * 1/256 * 2/256

That sums up to ca. 1/256 (did that numerically).  Does that make
sense?

> So we are saying that we accept ~1/100 false success rate, but
> additional ~1/30000 is unacceptable.
> 
> I do not know if I buy that argument, but I do think that additional
> false success rate, even if it is miniscule, is unnecessary.  So as
> long as everybody's regexp library is happy with "^0{64}{64}$",
> let's use that.

The parentheses are necessary ("^(0{64}){64}$"), at least on OpenBSD.
It doesn't accept consecutive repetition operators without them.  We
could use "^(00000000000000000000000000000000){128}$" instead if
there's a problem with that.

René