Am 09.08.2017 um 18:07 schrieb Junio C Hamano: > René Scharfe <l.s.r@xxxxxx> writes: > >>> In the face of unreliable segfaults we need to reverse our strategy, >>> I think. Searching for something not in the buffer (e.g. "1") and >>> considering matches and segfaults as confirmation that the bug is >>> still present should avoid any false positives. Right? >> >> And that's not it either. *sigh* >> >> If the 4097th byte is NUL or LF then we can only hope its access >> triggers a segfault -- there is no other way to distinguish the >> result from a legitimate match when limiting with REG_STARTEND. So >> we have to accept this flakiness. > >> We can check the value of that byte with [^0] and interpret a >> match as failure, but that adds negation and makes the test more >> complex. >> >> ^0*$ would falsely match if the 4097th byte (and possibly later >> ones) is 0. We need to make sure we check for end-of-line after >> the 4096th byte, not later. > > I do not have a strong objection against "^0{64}{64}$", but let me > make sure that I understand your rationale correctly. > > We assume that we do not have an unmapped guard page to reliably > trap us. So "^0{64}{64}$" will report a false success when that > 4097th byte is either NUL or LF. There is 2/256 probability of that > happening but we accept it. > > A pattern "0$" will give us a false success in the same case, but > the reason why it is worse is because in addition, we get a false > success if that second page begins with "0\0" or "0\n". The chance > of that happening is additional 2/256 * 1/256. Worse yet, the page > could start with "00\0" or "00\n", which is additional 2/256 * > 1/65536. We could have yet another "0" at the beginning of that > second page, which only increases the chance of us getting a false > sucess. I demonstrated a lack of logical thinking and now you bring on probabilities!? ;-) There could be any characters except NUL and LF between the 4096 zeros and "0$" for the latter to match wrongly, no? So there are 4095 opportunities for the misleading pattern in a page, with probabilities like this: 0$ 1/256 * 2/256 .0$ 254/256 * 1/256 * 2/256 ..0$ (254/256)^2 * 1/256 * 2/256 .{3}0$ (254/256)^3 * 1/256 * 2/256 .{4094}0$ (254/256)^4094 * 1/256 * 2/256 That sums up to ca. 1/256 (did that numerically). Does that make sense? > So we are saying that we accept ~1/100 false success rate, but > additional ~1/30000 is unacceptable. > > I do not know if I buy that argument, but I do think that additional > false success rate, even if it is miniscule, is unnecessary. So as > long as everybody's regexp library is happy with "^0{64}{64}$", > let's use that. The parentheses are necessary ("^(0{64}){64}$"), at least on OpenBSD. It doesn't accept consecutive repetition operators without them. We could use "^(00000000000000000000000000000000){128}$" instead if there's a problem with that. René