Hi Carlo, On Tue, 23 Jul 2019, Carlo Arenas wrote: > On Tue, Jul 23, 2019 at 5:47 AM Johannes Schindelin > <Johannes.Schindelin@xxxxxx> wrote: > > > > So when PCRE2 complains about the top two bits not being 0x80, it fails > > to parse the bytes correctly (byte 2 is 0xbb, whose two top bits are > > indeed 0x80). > > the error is confusing but it is not coming from the pattern, but from > what PCRE2 calls > the subject. > > meaning that while going through the repository it found content that > it tried to match but > that it is not valid UTF-8, like all the png and a few txt files that > are not encoded as > UTF-8 (ex: t/t3900/ISO8859-1.txt). > > > Maybe this is a bug in your PCRE2 version? Mine is 10.33... and this > > does not happen here... But then, I don't need the `-I` option, and my > > output looks like this: > > -I was just an attempt to workaround the obvious binary files (like > PNG); I'll assume you > should be able to reproduce if using a non JIT enabled PCRE2, > regardless of version. > > my point was that unlike in your report, I didn't have any test cases > failing, because > AFAIK there are no test cases using broken UTF-8 (the ones with binary data are > actually valid zero terminated UTF-8 strings) Thank you for this explanation. I think it makes a total lot of sense. So your motivation for this patch is actually a different one than mine, and I would like to think that this actually strengthens the case _in favor_ of it. The patch kind of kills two birds with one stone. Thanks, Dscho