Re: [PATCH 14/25] pickaxe -S: remove redundant "sz" check in while-loop

René Scharfe <l.s.r@xxxxxx> · Thu, 4 Feb 2021 17:16:29 +0100

Am 03.02.21 um 04:28 schrieb Ævar Arnfjörð Bjarmason:
> If we walk to the end of the string we just won't match the rest of
> the regex. This removes an optimization for simplicity's sake. In
> subsequent commits we'll alter this code more, and not having to think
> about this condition makes it easier to read.
>
> If we look at the context of what we're doing here the last thing we
> need to be worried about is one extra regex match. The real problem is
> that we keep matching after it's clear that the number of contains()
> for "A" and "B" is different. So we could be much smarter here.
>
> Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx>
> ---
>  diffcore-pickaxe.c | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
>
> diff --git a/diffcore-pickaxe.c b/diffcore-pickaxe.c
> index 208177bb40..8df76afb6e 100644
> --- a/diffcore-pickaxe.c
> +++ b/diffcore-pickaxe.c
> @@ -82,12 +82,11 @@ static unsigned int contains(mmfile_t *mf, regex_t *regexp, kwset_t kws)
>  		regmatch_t regmatch;
>  		int flags = 0;
>
> -		while (sz &&
> -		       !regexec_buf(regexp, data, sz, 1, &regmatch, flags)) {
> +		while (!regexec_buf(regexp, data, sz, 1, &regmatch, flags)) {

This will loop forever for regexes that match an empty string.  An
example would be /$/.  Silly, perhaps, but still I understand this check
less as an optimization and more as a correctness/robustness thing.

>  			flags |= REG_NOTBOL;
>  			data += regmatch.rm_eo;
>  			sz -= regmatch.rm_eo;
> -			if (sz && regmatch.rm_so == regmatch.rm_eo) {
> +			if (regmatch.rm_so == regmatch.rm_eo) {
>  				data++;
>  				sz--;
>  			}

Before, if the match was an empty string and there was more data after
it, then the code would consume a character anyway, in order to avoid
matching the same empty string again.  With the patch, that character
is consumed even if there is no more data.  This leaves 'data'
pointing beyond the buffer and 'sz' rolls over to ULONG_MAX.  Oops. :(

René