Re: [PATCH v4 2/2] grep/pcre2: better support invalid UTF-8 haystacks

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 24/01/2021 11:48, Ævar Arnfjörð Bjarmason wrote:
> Improve the support for invalid UTF-8 haystacks given a non-ASCII
> needle when using the PCREv2 backend.
> 
> This is a more complete fix for a bug I started to fix in
> 870eea8166 (grep: do not enter PCRE2_UTF mode on fixed matching,
> 2019-07-26), now that PCREv2 has the PCRE2_MATCH_INVALID_UTF mode we
> can make use of it.
> 
> This fixes the sort of case described in 8a5999838e (grep: stess test
> PCRE v2 on invalid UTF-8 data, 2019-07-26), i.e.:
> 
>     - The subject string is non-ASCII (e.g. "ævar")
>     - We're under a is_utf8_locale(), e.g. "en_US.UTF-8", not "C"
>     - We are using --ignore-case, or we're a non-fixed pattern
> 
> If those conditions were satisfied and we matched found non-valid
> UTF-8 data PCREv2 might bark on it, in practice this only happened
> under the JIT backend (turned on by default on most platforms).
> 
> Ultimately this fixes a "regression" in b65abcafc7 ("grep: use PCRE v2
> for optimized fixed-string search", 2019-07-01), I'm putting that in
> scare-quotes because before then we wouldn't properly support these
> complex case-folding, locale etc. cases either, it just broke in
> different ways.
> 
> There was a bug related to this the PCRE2_NO_START_OPTIMIZE flag fixed
> in PCREv2 10.36. It can be worked around by setting the
> PCRE2_NO_START_OPTIMIZE flag. Let's do that in those cases, and add
> tests for the bug.
> 
> Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx>
> ---
>  Makefile                        |  1 +
>  grep.c                          |  8 +++++-
>  grep.h                          |  4 +++
>  t/helper/test-pcre2-config.c    | 12 +++++++++
>  t/helper/test-tool.c            |  1 +
>  t/helper/test-tool.h            |  1 +
>  t/t7812-grep-icase-non-ascii.sh | 46 ++++++++++++++++++++++++++++++++-
>  7 files changed, 71 insertions(+), 2 deletions(-)
>  create mode 100644 t/helper/test-pcre2-config.c
> 
> diff --git a/Makefile b/Makefile
> index 4edfda3e00..42a7ed96e2 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -722,6 +722,7 @@ TEST_BUILTINS_OBJS += test-online-cpus.o
>  TEST_BUILTINS_OBJS += test-parse-options.o
>  TEST_BUILTINS_OBJS += test-parse-pathspec-file.o
>  TEST_BUILTINS_OBJS += test-path-utils.o
> +TEST_BUILTINS_OBJS += test-pcre2-config.o
>  TEST_BUILTINS_OBJS += test-pkt-line.o
>  TEST_BUILTINS_OBJS += test-prio-queue.o
>  TEST_BUILTINS_OBJS += test-proc-receive.o
> diff --git a/grep.c b/grep.c
> index efeb6dc58d..e329f19877 100644
> --- a/grep.c
> +++ b/grep.c
> @@ -492,7 +492,13 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt
>  	}
>  	if (!opt->ignore_locale && is_utf8_locale() && has_non_ascii(p->pattern) &&
>  	    !(!opt->ignore_case && (p->fixed || p->is_fixed)))
> -		options |= PCRE2_UTF;
> +		options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF);
> +
> +	if (PCRE2_MATCH_INVALID_UTF &&
> +	    options & (PCRE2_UTF | PCRE2_CASELESS) &&
> +	    !(PCRE2_MAJOR >= 10 && PCRE2_MAJOR >= 36))
                                   ^^^^^^^^^^^^^^^^^^
I assume that this should be s/_MAJOR/_MINOR/. ;-)

> +		/* Work around https://bugs.exim.org/show_bug.cgi?id=2642 fixed in 10.36 */
> +		options |= PCRE2_NO_START_OPTIMIZE;
>  
>  	p->pcre2_pattern = pcre2_compile((PCRE2_SPTR)p->pattern,
>  					 p->patternlen, options, &error, &erroffset,

ATB,
Ramsay Jones





[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux