Re: [PATCH] grep: skip UTF8 checks explicitally

Carlo Arenas <carenas@xxxxxxxxx> · Tue, 23 Jul 2019 18:47:38 -0700

On Tue, Jul 23, 2019 at 5:47 AM Johannes Schindelin
<Johannes.Schindelin@xxxxxx> wrote:
>
> So when PCRE2 complains about the top two bits not being 0x80, it fails
> to parse the bytes correctly (byte 2 is 0xbb, whose two top bits are
> indeed 0x80).

the error is confusing but it is not coming from the pattern, but from
what PCRE2 calls
the subject.

meaning that while going through the repository it found content that
it tried to match but
that it is not valid UTF-8, like all the png and a few txt files that
are not encoded as
UTF-8 (ex: t/t3900/ISO8859-1.txt).

> Maybe this is a bug in your PCRE2 version? Mine is 10.33... and this
> does not happen here... But then, I don't need the `-I` option, and my
> output looks like this:

-I was just an attempt to workaround the obvious binary files (like
PNG); I'll assume you
should be able to reproduce if using a non JIT enabled PCRE2,
regardless of version.

my point was that unlike in your report, I didn't have any test cases
failing, because
AFAIK there are no test cases using broken UTF-8 (the ones with binary data are
actually valid zero terminated UTF-8 strings)

Carlo