On Fri, Jul 26, 2019 at 8:15 AM Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> wrote: > I'm not sure what a real fix for that is. Part of it is probably 8/8 in > the series I mention below, but more generally we'd need to be more > encoding aware at a much higher callsite than "grep". So e.g. we'd know > that we match "binary" data as not-UTF-8. Now we just throw arbitrary > bytes around and hope something sticks. I haven't look yet at your proposed changes, but my gut feeling is that the work to support invalid UTF in the yet unreleased PCRE version would be needed as part of it, and therefore it might be better to keep PCRE out of the main path until that gets released and can be relied upon. kwset is not going away with this series anyway, regardless of the no-kwset name on the branch. > > If we're already deciding to paper over things, I'd much rather prefer > > the simpler patch, i.e. Carlo's. > > As I noted upthread PCRE's own docs promise undefined behavior and fire > and brimstone if that patch is applied. Those last two not > guaranteed. So we need another solution. in my original reply I mentioned I explicitly didn't do a test because of this "undefined behavior", but I think it should be fair to mention that we are already affected by that because using the JIT fast path does skip any UTF-8 validations and is currently possible to get git into an infinite loop or make it segfault when using PCRE. in that line, I am not sure I understand the pushback against making that explicit since it only makes both codepaths behave the same (bugs and risks of burning alike) Carlo