Re: bug#60690: [PATCH v2] grep: correctly identify utf-8 characters with \{b, w} in -P

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 1/9/23 03:35, Ævar Arnfjörð Bjarmason wrote:

You almost never want "everything Unicode considers a digit", and if you
do using e.g. \p{Nd} instead of \d would be better in terms of
expressing your intent.

For GNU grep, PCRE2_UCP is needed because of examples like what Gro-Tsen and Karl Petterssen supplied. If there's some diagreement about how \d should behave with UTF-8 data the GNU grep hackers should let the Perl community decide that; that is, GNU grep can simply follow PCRE2's lead. But GNU grep does need PCRE2_UCP for \b etc.

	$ diff <(git -P grep -P '\d+') <(git -P grep -P '(*UCP)\d')
	53360a53361,53362
	> git-gui/po/ja.po:"- 第1行: 何をしたか、を1行で要約。\n"
	> git-gui/po/ja.po:"- 第2行: 空白\n"

Although I don't speak Japanese I have dealt with quite a bit of Japanese text in a previous job, and personally I would prefer \d to match those two lines as they do contain digits. So to me this particular case is not a good argument that git grep should not match those lines.

Of course other people might prefer differently, and there are cases where I want to match only ASCII digits. I've learned in the past to use [0-9] for that. I hope PCRE2 never changes [0-9] to match anything but ASCII digits when searching UTF-8 text.




[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux