On Mon, Apr 3, 2023 at 2:39 PM Paul Eggert <eggert@xxxxxxxxxxx> wrote: > I've recently done some bug-report maintenance about a set of GNU grep > bug reports related to whether whether "grep -P '\d'" should match > non-ASCII digits, and have some thoughts about coordinating GNU grep > with git grep in this department. > > GNU Bug#62605[1] "`[\d]` does not work with PCRE" has been fixed on > Savannah's copy of GNU grep, and some sort of fix should appear in the > next grep release. However, I'm leaving the GNU grep bug report open for > now because it's related to Bug#60690[2] "[PATCH v2] grep: correctly > identify utf-8 characters with \{b,w} in -P" and to Bug#62552[3] "Bug > found in latest stable release v3.10 of grep". I merged these related > bug reports, and the oldest one, Bug#60690, is now the representative > displayed in the GNU grep bug list[4]. > > For this set of grep bug reports there's still a pending issue discussed > in my recent email[5], which proposes a patch so I've tagged Bug#60690 > with "patch". The proposal is that GNU grep -P '\d' should revert to the > grep 3.9 behavior, i.e., that in a UTF-8 locale, \d should also match > non-ASCII decimal digits. > > In researching this a bit further, I found that on March 23 Git disabled > the use of PCRE2_UCP in PCRE2 10.34 or earlier[6], due to a PCRE2 bug > that can cause a crash when PCRE2_UCP is used[7]. A bug fix[8] should > appear in the next PCRE2 release. > > When PCRE2 10.35 comes out, Thanks for finding that. It's clearly a good idea to disable PCRE2_UCP for those using those older, known-buggy versions of pcre2. The latest is 10.42, per https://github.com/PCRE2Project/pcre2/releases > it appears that 'git grep -P' will behave > like 'grep -P' only if GNU grep adopts something like the solution > proposed in [5]. > > [1]: https://bugs.gnu.org/62605 > [2]: https://bugs.gnu.org/60690 > [3]: https://bugs.gnu.org/62552 > [4]: https://debbugs.gnu.org/cgi/pkgreport.cgi?package=grep > [5]: https://lists.gnu.org/archive/html/grep-devel/2023-04/msg00004.html > [6]: > https://github.com/git/git/commit/14b9a044798ebb3858a1f1a1377309a3d6054ac8 > [7]: > https://lore.kernel.org/git/7E83DAA1-F9A9-4151-8D07-D80EA6D59EEA@xxxxxxxxxx/ > [8]: > https://github.com/git/git/commit/14b9a044798ebb3858a1f1a1377309a3d6054ac8 Thanks for all of the links. However, have you seen justification (other than for compatibility with some other tool or language) for allowing \d to match non-ASCII by default, in spite of the risks? IMHO, we have an obligation to retain compatibility with how grep -P '\d' has worked since -P was added. I'd be happy to see an option to enable the match-multibyte-digits behavior, but making it the default seems too likely to introduce unwarranted risk.