Re: bug#60690: -P '\d' in GNU and git grep

Jim Meyering <jim@xxxxxxxxxxxx> · Mon, 3 Apr 2023 20:30:02 -0700

On Mon, Apr 3, 2023 at 2:39 PM Paul Eggert <eggert@xxxxxxxxxxx> wrote:
> I've recently done some bug-report maintenance about a set of GNU grep
> bug reports related to whether whether "grep -P '\d'" should match
> non-ASCII digits, and have some thoughts about coordinating GNU grep
> with git grep in this department.
>
> GNU Bug#62605[1] "`[\d]` does not work with PCRE" has been fixed on
> Savannah's copy of GNU grep, and some sort of fix should appear in the
> next grep release. However, I'm leaving the GNU grep bug report open for
> now because it's related to Bug#60690[2] "[PATCH v2] grep: correctly
> identify utf-8 characters with \{b,w} in -P" and to Bug#62552[3] "Bug
> found in latest stable release v3.10 of grep". I merged these related
> bug reports, and the oldest one, Bug#60690, is now the representative
> displayed in the GNU grep bug list[4].
>
> For this set of grep bug reports there's still a pending issue discussed
> in my recent email[5], which proposes a patch so I've tagged Bug#60690
> with "patch". The proposal is that GNU grep -P '\d' should revert to the
> grep 3.9 behavior, i.e., that in a UTF-8 locale, \d should also match
> non-ASCII decimal digits.
>
> In researching this a bit further, I found that on March 23 Git disabled
> the use of PCRE2_UCP in PCRE2 10.34 or earlier[6], due to a PCRE2 bug
> that can cause a crash when PCRE2_UCP is used[7]. A bug fix[8] should
> appear in the next PCRE2 release.
>
> When PCRE2 10.35 comes out,

Thanks for finding that.
It's clearly a good idea to disable PCRE2_UCP for those using those
older, known-buggy versions of pcre2.

The latest is 10.42, per https://github.com/PCRE2Project/pcre2/releases

> it appears that 'git grep -P' will behave
> like 'grep -P' only if GNU grep adopts something like the solution
> proposed in [5].
>
> [1]: https://bugs.gnu.org/62605
> [2]: https://bugs.gnu.org/60690
> [3]: https://bugs.gnu.org/62552
> [4]: https://debbugs.gnu.org/cgi/pkgreport.cgi?package=grep
> [5]: https://lists.gnu.org/archive/html/grep-devel/2023-04/msg00004.html
> [6]:
> https://github.com/git/git/commit/14b9a044798ebb3858a1f1a1377309a3d6054ac8
> [7]:
> https://lore.kernel.org/git/7E83DAA1-F9A9-4151-8D07-D80EA6D59EEA@xxxxxxxxxx/
> [8]:
> https://github.com/git/git/commit/14b9a044798ebb3858a1f1a1377309a3d6054ac8

Thanks for all of the links. However, have you seen justification
(other than for compatibility with some other tool or language) for
allowing \d to match non-ASCII by default, in spite of the risks?
IMHO, we have an obligation to retain compatibility with how grep -P
'\d' has worked since -P was added. I'd be happy to see an option to
enable the match-multibyte-digits behavior, but making it the default
seems too likely to introduce unwarranted risk.