On Thu, Feb 04 2021, Junio C Hamano wrote: > René Scharfe <l.s.r@xxxxxx> writes: > >>> - while (sz && >>> - !regexec_buf(regexp, data, sz, 1, ®match, flags)) { >>> + while (!regexec_buf(regexp, data, sz, 1, ®match, flags)) { >> >> This will loop forever for regexes that match an empty string. An >> example would be /$/. Silly, perhaps, but still I understand this check >> less as an optimization and more as a correctness/robustness thing. >> >>> flags |= REG_NOTBOL; >>> data += regmatch.rm_eo; >>> sz -= regmatch.rm_eo; >>> - if (sz && regmatch.rm_so == regmatch.rm_eo) { >>> + if (regmatch.rm_so == regmatch.rm_eo) { >>> data++; >>> sz--; >>> } >> >> Before, if the match was an empty string and there was more data after >> it, then the code would consume a character anyway, in order to avoid >> matching the same empty string again. With the patch, that character >> is consumed even if there is no more data. This leaves 'data' >> pointing beyond the buffer and 'sz' rolls over to ULONG_MAX. Oops. :( > > While I do not care too much about NUL in the haystack, I do not > mind [13/25] either. But this is bad. > > This whole thing reminds me of f53c5de2 (pickaxe: fix segfault with > '-S<...> --pickaxe-regex', 2017-03-18), by the way. René: Well spotted, thanks, and Oops. I've just sent a separate series with 01-10 of this one. I'm sitting on the diffcore-pickaxe patches for a while. I've got local fixes for a lot of issues in it, will fix this one too. I've optimized the PCRE v2 codepath a lot more in my local version. Current results are: 4209.1: git log -S'int main' <limit-rev>.. 0.38(0.36+0.01) 0.37(0.33+0.04) -2.6% 4209.2: git log -S'æ' <limit-rev>.. 0.51(0.47+0.04) 0.32(0.27+0.05) -37.3% 4209.3: git log --pickaxe-regex -S'(int|void|null)' <limit-rev>.. 0.72(0.68+0.03) 0.57(0.54+0.03) -20.8% 4209.4: git log --pickaxe-regex -S'if *\([^ ]+ & ' <limit-rev>.. 0.60(0.55+0.02) 0.39(0.34+0.05) -35.0% 4209.5: git log --pickaxe-regex -S'[àáâãäåæñøùúûüýþ]' <limit-rev>.. 0.43(0.40+0.03) 0.50(0.44+0.06) +16.3% 4209.6: git log -G'(int|void|null)' <limit-rev>.. 0.64(0.55+0.09) 0.63(0.56+0.05) -1.6% 4209.7: git log -G'if *\([^ ]+ & ' <limit-rev>.. 0.64(0.59+0.05) 0.63(0.56+0.06) -1.6% 4209.8: git log -G'[àáâãäåæñøùúûüýþ]' <limit-rev>.. 0.63(0.54+0.08) 0.62(0.55+0.06) -1.6% 4209.9: git log -i -S'int main' <limit-rev>.. 0.39(0.35+0.03) 0.38(0.35+0.02) -2.6% 4209.10: git log -i -S'æ' <limit-rev>.. 0.39(0.33+0.06) 0.32(0.28+0.04) -17.9% 4209.11: git log -i --pickaxe-regex -S'(int|void|null)' <limit-rev>.. 0.90(0.84+0.05) 0.58(0.53+0.04) -35.6% 4209.12: git log -i --pickaxe-regex -S'if *\([^ ]+ & ' <limit-rev>.. 0.71(0.64+0.06) 0.40(0.37+0.03) -43.7% 4209.13: git log -i --pickaxe-regex -S'[àáâãäåæñøùúûüýþ]' <limit-rev>.. 0.43(0.40+0.03) 0.50(0.46+0.04) +16.3% 4209.14: git log -i -G'(int|void|null)' <limit-rev>.. 0.64(0.57+0.06) 0.62(0.56+0.05) -3.1% 4209.15: git log -i -G'if *\([^ ]+ & ' <limit-rev>.. 0.65(0.59+0.06) 0.63(0.54+0.08) -3.1% 4209.16: git log -i -G'[àáâãäåæñøùúûüýþ]' <limit-rev>.. 0.63(0.55+0.08) 0.62(0.56+0.05) -1.6% The main optimization was just moving the compilation of the pattern up the stack into the diff_options struct, the current version in this thread re-compiles it every time.