Re: [PATCH v3] grep: correctly identify utf-8 characters with \{b,w} in -P

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Jan 17, 2023 at 7:19 AM Junio C Hamano <gitster@xxxxxxxxx> wrote:
>
> Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> writes:
>
> > To argue with myself here, I'm not so sure that just making this the
> > default isn't the right move, especially as the GNU grep maintainer
> > seems to be convinced that that's the right thing for grep(1).
>
> OK.

I think that is definitely the right thing to do for grep, because the
current behaviour can only be described as a bug (and a bad one at
it), but after all the push back and performance testing, I am also
not convinced anymore it needs to be the default for git, because the
negatives outweigh the positives.

First there is the performance hit, which is inevitable because there
are just a lot more characters to match when UCP tables are being
used, and second there is the fact that PCRE2_UCP itself might not be
what you want when matching code, because for example numbers are
never going to be using digits outside what ASCII provides, and
identifiers have a narrow set of characters as valid than what you
would expect from all written human languages in history.

Lastly, even with PCRE2_UCP enabled, our current logic for word
matches is still broken, because the current code still uses a
definition of word that was done outside what the regex engines
provide and that roughly matches what you would expect of identifiers
from C in the ASCII times.

> > Of course all of this is predicated on us wanting to leave this as an
> > opt-in, which I'm not so sure about. If it's opt-out we'll avoid this
> > entire question,
>
> Making it opt-out would also require a similar knob to turn the
> "flag" off, be it a configuration variable or a command line option,
> wouldn't it?  I tend to agree with you that it makes sense to make
> it a goal to take us closer to "grep -P" from GNU---do they have
> such an opt-out knob?  If not, let's make it simple by turning it
> always on, which would be the simplest ;-)

GNU grep -P has no knob and would likely never have one.

So for now, I think we should acknowledge the bug, provide an option
for people that might need the fix, and fix all other problems we
have, which will include changes in PCRE2 as well to better fit our
use case.

Carlo




[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux