Re: grep: fix multibyte regex handling under macOS (1819ad327b7a1f19540a819813b70a0e8a7f798f)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, 3 Feb 2023 at 18:01, Jeff King <peff@xxxxxxxx> wrote:
>
> On Thu, Feb 02, 2023 at 05:22:37PM +0100, demerphq wrote:
>
> > I've been lurking watching some of the regex discussion on the list
> > and personally I think it is asking for trouble to use "whatever regex
> > engine is traditional in a given environment" instead of just choosing
> > a good open source engine and using it consistently everywhere.  I
> > don't really buy the arguments I have seen to justify a policy of "use
> > the standard library version"; regex engines vary widely in
> > performance and implementation and feature set, and even the really
> > good ones do not entirely agree on every semantic[1], so if you don't
> > standardize you will be forever dealing with bugs related to those
> > differences.
>
> I think this is a perennial question for portable software: is it better
> to be consistent across platforms (by shipping our own regex engine), or
> consistent with other programs on the same platform (by using the system
> regex).

Personally I think that while this seems to be an impartial reading of
the question at hand I think it frames the debate in a way that has
the potential to bias[1] the discussion in favour of a particular
policy outcome[2]. It implies that all other things are equal between
the two options presented, and frames the question as something that
comes down to personal preference between one form of consistency and
another.

But I am not sure that all other things are equal here, at least as
far as regex engines go. I think there is evidence that suggests that
depending on the system regex engine introduces long term recurring
costs that would not be incurred if git chose to link to a specific
library everywhere. I think it ignores the implications on the wider
ecosystem and toolchains. For instance if the behavior of git grep
differs by platform then scripts that might want to use git grep to
automate git become less portable, or more expensive to maintain to
work around the inconsistencies.  It overlooks the costs of training
humans (arguably low) versus the costs of training computers (arguably
high). It also assumes that being consistent with other programs on
the same platform is inherently beneficial, when that doesn't seem to
be clearly established[2]. It also assumes that there are only two
options. Maybe there are more. Maybe there is a third or fourth option
as well. One would be to use a specific library for internal regexes,
and let the command line use the system library by default. Another
would be to make the default engine be one that ships with git, and
that users that want "platform compatibility" should use an option to
get it, much as they would with -P to enable PCRE. You mentioned that
there is already such an engine, but it isn't well tested. Maybe that
should be changed.

I think that if you look at other broadly ported projects there is
evidence that owning your own dependencies makes a project easier and
cheaper to port to new platforms. The more platforms you target the
more room there is for inconsistencies and the more costs there will
be to deal with them. If portability of git is a goal and minimizing
the cost of doing so is a secondary goal then I would say that using a
specific library will make achieving that goal easier and lower cost
than depending on the system libraries. There would be a high initial
cost to do the switch, and then a low cost in the long run.  As far as
I know Perl is more broadly ported than git, based on the fact that
when we migrated to git a number of the Perl maintainers could not use
git on their platforms (Vax comes to mind), and Perl definitely adopts
the view that it is better to own your dependencies, and wraps and
hides system inconsistencies as much as possible to make porting as
easy as possible. So that is one precedent to consider. No doubt there
are many other long running projects with precedents in this area.
What does Vi or Vim do? What does Emacs do? Etc.

Anyway, my opinion on these things doesn't matter that much, I am just
a git user who happens to have a passion for regular expressions and
regular expression engines and I am happily served by the PCRE support
in my git build. So I can't answer these questions for the project.
But I do think the questions that need to be answered are more complex
and nuanced than deciding which of two forms of consistency is more
important.

Thanks for hearing me out.

cheers,
yves
[1] PS I do not mean to imply that you are *intending* to bias the
discussion. I think your writing and measured approach indicates
strongly that you intend to be impartial, but nevertheless I think
this way of framing the question does bias the debate.

[2] My thinking about the framing of this question is probably pretty
strongly influenced by a recently released report from the BBC on
journalistic bias in presenting questions of economic debate. The
report presents some interesting perspectives on how framing questions
and data can bias the discussion even though the person presenting the
data or asking the question actually intended to be impartial.
Regardless of your position on what regex engine git should use I
think it is a good read. Especially in this day and age of austerity
politics and debt-ceiling debates around the world.
https://www.bbc.co.uk/aboutthebbc/documents/thematic-review-taxation-public-spending-govt-borrowing-debt.pdf

[3] Personally I use git grep with patterns and the -P flag far more
than I use grep with anything other than a constant string. These days
I barely ever use grep, so to me what it does is entirely irrelevant
to what git does.  I suspect I am not alone in this.
-- 
perl -Mre=debug -e "/just|another|perl|hacker/"



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux