Re: [PATCH] grep: skip UTF8 checks explicitally

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Jul 24 2019, Johannes Schindelin wrote:

> Hi Carlo,
>
> On Tue, 23 Jul 2019, Carlo Arenas wrote:
>
>> On Tue, Jul 23, 2019 at 5:47 AM Johannes Schindelin
>> <Johannes.Schindelin@xxxxxx> wrote:
>> >
>> > So when PCRE2 complains about the top two bits not being 0x80, it fails
>> > to parse the bytes correctly (byte 2 is 0xbb, whose two top bits are
>> > indeed 0x80).
>>
>> the error is confusing but it is not coming from the pattern, but from
>> what PCRE2 calls
>> the subject.
>>
>> meaning that while going through the repository it found content that
>> it tried to match but
>> that it is not valid UTF-8, like all the png and a few txt files that
>> are not encoded as
>> UTF-8 (ex: t/t3900/ISO8859-1.txt).
>>
>> > Maybe this is a bug in your PCRE2 version? Mine is 10.33... and this
>> > does not happen here... But then, I don't need the `-I` option, and my
>> > output looks like this:
>>
>> -I was just an attempt to workaround the obvious binary files (like
>> PNG); I'll assume you
>> should be able to reproduce if using a non JIT enabled PCRE2,
>> regardless of version.
>>
>> my point was that unlike in your report, I didn't have any test cases
>> failing, because
>> AFAIK there are no test cases using broken UTF-8 (the ones with binary data are
>> actually valid zero terminated UTF-8 strings)
>
> Thank you for this explanation. I think it makes a total lot of sense.
>
> So your motivation for this patch is actually a different one than mine,
> and I would like to think that this actually strengthens the case _in
> favor_ of it. The patch kind of kills two birds with one stone.

This patch is really the wrong thing to do. Don't get me wrong, I'm
sympathetic to the *problem* and it should be solved, but this isn't the
solution.

The PCRE2_NO_UTF_CHECK flag means "I have checked that this is a valid
UTF-8 string so you, PCRE, don't need to re-check it". To quote
pcre2api(3):

    If you know that your pattern is a valid UTF string, and you want to
    skip this check for performance reasons, you can set the
    PCRE2_NO_UTF_CHECK option. When it is set, the effect of passing an
    in‐ valid UTF string as a pattern is undefined. It may cause your
    program to crash or loop.

(Later it's discussed that "pattern" here is also "subject string" in
the context of pcre2_{jit_,}match()).

I know almost nothing about the internals of PCRE's engine, but much of
it's based on Perl's, which I know way better. Doing the equivalent of
this in perl (setting the UTF8 flag on a SV) *will* cause asserts to
fail and possibly segfaults.

It's likely through dumb luck that this is "working". I.e. yes the JIT
mode is less anal about these checks, so if you say grep for "Nguyễn
Thái" in UTF-8 mode and there's binary data you're satisfied not to find
anything in that binary data.

But if you are I'm willing to bet this ruins your day, e.g PCRE would
"skip ahead" a character 4-byte character because it sees a telltale
U+10000 through U+10FFFF start sequence, except that wasn't a character,
it was some arbitrary binary.

Now, what is the solution? I don't have any patches yet, but things I
intend to look at:

 1) We're oversupplying PCRE2_UTF now, and one such case is what's being
    reported here. I.e. there's no reason I can think of for why a
    fixed-string pattern should need PCRE2_UTF set when not combined
    with --ignore-case. We can just not do that, but maybe I'm missing
    something there.

 2) We can do "try utf8, and fallback". A more advanced version of this
    is what the new PCRE2_MATCH_INVALID_UTF flag (mentioned upthread)
    does. I was thinking something closer to just carrying two compiled
    patterns, and falling back on the ~PCRE2_UTF one if we get a
    PCRE2_ERROR_UTF8_* error.

One reason we can't "just" go back to the pre-ab/no-kwset behavior is
that one thing it does is fix a long-standing bug where we'd do the
wrong thing under locales && -i && UTF-8 string/pattern. More precisely
we'd punt it to the C library's matching function, which would probably
do the wrong thing.




[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux