Re: [PATCH] grep: skip UTF8 checks explicitally

Carlo Arenas <carenas@xxxxxxxxx> · Fri, 26 Jul 2019 08:53:49 -0700

On Fri, Jul 26, 2019 at 8:15 AM Ævar Arnfjörð Bjarmason
<avarab@xxxxxxxxx> wrote:
> I'm not sure what a real fix for that is. Part of it is probably 8/8 in
> the series I mention below, but more generally we'd need to be more
> encoding aware at a much higher callsite than "grep". So e.g. we'd know
> that we match "binary" data as not-UTF-8. Now we just throw arbitrary
> bytes around and hope something sticks.

I haven't look yet at your proposed changes, but my gut feeling is that
the work to support invalid UTF in the yet unreleased PCRE version would
be needed as part of it, and therefore it might be better to keep PCRE
out of the main path until that gets released and can be relied upon.

kwset is not going away with this series anyway, regardless of the no-kwset
name on the branch.

> > If we're already deciding to paper over things, I'd much rather prefer
> > the simpler patch, i.e. Carlo's.
>
> As I noted upthread PCRE's own docs promise undefined behavior and fire
> and brimstone if that patch is applied. Those last two not
> guaranteed. So we need another solution.

in my original reply I mentioned I explicitly didn't do a test because of this
"undefined behavior", but I think it should be fair to mention that we are
already affected by that because using the JIT fast path does skip any
UTF-8 validations and is currently possible to get git into an infinite loop
or make it segfault when using PCRE.

in that line, I am not sure I understand the pushback against making that
explicit since it only makes both codepaths behave the same (bugs and
risks of burning alike)

Carlo