Hi Carlo, On Mon, 22 Jul 2019, Carlo Arenas wrote: > On Mon, Jul 22, 2019 at 12:42 PM Ævar Arnfjörð Bjarmason > <avarab@xxxxxxxxx> wrote: > > > > On Mon, Jul 22 2019, Johannes Schindelin wrote: > > > > > So I am fine with this patch. > > > > I'm not, not because I dislike the approach. I haven't made up my mind > > yet. > > my bad, I should had explained better the regression I was trying to > fix with this patch : > > $ git version > git version 2.22.0.709.g102302147b.dirty > $ git grep "Nguyễn Thái" > .mailmap:Nguyễn Thái Ngọc Duy <pclouds@xxxxxxxxx> > fatal: pcre2_match failed with error code -22: UTF-8 error: isolated > byte with 0x80 bit set > $ git grep -I "Nguyễn Thái" > .mailmap:Nguyễn Thái Ngọc Duy <pclouds@xxxxxxxxx> > po/TEAMS:Members: Nguyễn Thái Ngọc Duy <pclouds AT gmail.com> > po/vi.po:# Nguyễn Thái Ngọc Duy <pclouds@xxxxxxxxx>, 2012. > t/t1302-repo-version.sh:# Copyright (c) 2007 Nguyễn Thái Ngọc Duy > t/t2104-update-index-skip-worktree.sh:# Copyright (c) 2008 Nguyễn Thái Ngọc Duy > fatal: pcre2_match failed with error code -8: UTF-8 error: byte 2 top > bits not 0x80 This is worrisome. The entire reason why we switch on Unicode mode is because it matters for regular expressions. Take for example this one: Nguyễ*n See how the ễ is supposed to be present an arbitrary number of times (including 0 times)? If PCRE2 was to interpret this as UTF-8, it would understand that the bytes 0xe1 0xbb 0x85 had to be repeated (or be missing), but if it interprets it _not_ as UTF-8, then it would handle _only the last_ byte, 0x85, that way, which will not work correctly. So when PCRE2 complains about the top two bits not being 0x80, it fails to parse the bytes correctly (byte 2 is 0xbb, whose two top bits are indeed 0x80). Maybe this is a bug in your PCRE2 version? Mine is 10.33... and this does not happen here... But then, I don't need the `-I` option, and my output looks like this: -- snip -- $ git grep --perl-regexp "Nguyễn Thái" .mailmap:Nguyễn Thái Ngọc Duy <pclouds@xxxxxxxxx> po/TEAMS:Members: Nguyễn Thái Ngọc Duy <pclouds AT gmail.com> po/vi.po:# Nguyễn Thái Ngọc Duy <pclouds@xxxxxxxxx>, 2012. t/t1302-repo-version.sh:# Copyright (c) 2007 Nguyễn Thái Ngọc Duy t/t2104-update-index-skip-worktree.sh:# Copyright (c) 2008 Nguyễn Thái Ngọc Duy t/t7011-skip-worktree-reading.sh:# Copyright (c) 2008 Nguyễn Thái Ngọc Duy t/t7012-skip-worktree-writing.sh:# Copyright (c) 2008 Nguyễn Thái Ngọc Duy -- snap -- No error, you see? Or maybe it is a problem with the locale? Is your locale using an encoding different than UTF-8? > > I stopped paying attention to this error-with-not-JIT discussion when I > > heard that some other series going into next for Windows fixed that > > issue[1] > > > > But now we have it again in some form? My ab/no-kwset has a lot of tests > > for encodings & locales combined with grep, don't some of those trigger > > this? If so we should make any such failure a test & part of this patch. > > I don't have a windows environment to test, and I admit I wasn't > following clearly the whole conversation but at least in my case, I > never got any test to fail, and I haven't seen any test with broken > UTF-8. The problem was not Windows-specific. It was specific to PCRE2 versions compiled without JIT support (which was the case in Git for Windows' SDK, but I fixed this on July 5th). Your patch adequately addresses this problem: instead of erroring out in the non-JIT version and passing in the JIT version, it will let also the non-JIT version pass. But still, I think the issue you found merits some deeper investigation, methinks. > > As noted in [2] I'd be inclined to go the other way, if we indeed have > > some cases where PCRE skips its own checks does not dying actually give > > us anything useful? I'd think not, so just ignoring the issue seems like > > the wrong thing to do. > > we could reenable the checks by moving out of the JIT fast path in pcre so > that pcre2_match is used for all matches (with JIT compiled as an optimization) > and that might have the added benefit of solving the PCRE errors when JIT is > broken[1] > > $ git version > git version 2.22.0 > $ git grep "Nguyễn Thái" > .mailmap:Nguyễn Thái Ngọc Duy <pclouds@xxxxxxxxx> > po/TEAMS:Members: Nguyễn Thái Ngọc Duy <pclouds AT gmail.com> > po/vi.po:# Nguyễn Thái Ngọc Duy <pclouds@xxxxxxxxx>, 2012. > t/t1302-repo-version.sh:# Copyright (c) 2007 Nguyễn Thái Ngọc Duy > t/t2104-update-index-skip-worktree.sh:# Copyright (c) 2008 Nguyễn Thái Ngọc Duy > t/t7011-skip-worktree-reading.sh:# Copyright (c) 2008 Nguyễn Thái Ngọc Duy > t/t7012-skip-worktree-writing.sh:# Copyright (c) 2008 Nguyễn Thái Ngọc Duy > > important to note that at least on my system (macOS 10.14.6) the output of all > engines for grep is the same for a UTF-8 pattern match, even if using invalid > UTF-8 in the corpus, and I would expect that to be a better indicative > of correctness Right. It is _inconsistent_ at the moment, and with your patch it would at least be consistent again. So I still would prefer your patch over "going the other way". > FWIW GNU grep (with -P) also ignores UTF-8 errors using the same flag and > even PCRE seems to be going in the other direction with the recent addition > of PCRE2_MATCH_INVALID_UTF[1] Good point! Ciao, Dscho > > $ od -b int.p > 0000000 102 145 154 303 263 156 012 303 > $ pcre2grep -U 'Beló' int.p > Belón > > Carlo > > [1] https://public-inbox.org/git/20181209230024.43444-1-carenas@xxxxxxxxx/ > [2] https://lists.exim.org/lurker/message/20190528.141422.2af1e386.gl.html >