Re: [PATCH v4 05/10] userdiff: add and use for_each_userdiff_driver()

Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> · Sat, 10 Apr 2021 14:30:10 +0200

On Sat, Apr 10 2021, Junio C Hamano wrote:

> Jeff King <peff@xxxxxxxx> writes:
>
>>> That plan involves an "git rm -r compat/regex" and a compat/pcre
>>> instead, I have some long-left-over patches for that.
>>
>> OK. I was more worried about platforms where it was cumbersome to
>> install pcre. If we are shipping it as a vendored library, then at least
>> the dependency management is on us, and not the user. I do worry a
>> little about running into complications with building or debugging it.
>
> Please don't.  I prefer not shipping compat/pcre ourselves; having
> to worry about version skew and picking up upstream security fixes
> in time, etc. for an external library that is well maintained is not
> what we need.

I'm not submitting patches for this now, I personally don't care much if
PCRE became a required dep (as in you need the library) and we didn't
ship it.

I just assumed if I'd propose that that we'd want "make" to work on more
bare-bones systems, since we e.g. carry a copy of
sha1collisiondetection/ partially to facilitate that.

But FWIW, and to the extent I've got time etc. my rough plan for this
larger PCRE conpsiary was:

 1. Get the current pickaxe-prep-for-PCRE patches to land.

 2. I've got patches on top of that which move over to PCRE, so as with
    the existing grep.c code we could drop kwset entirely (declare that
    if you need optimized fast search, you can install PCRE).

 3. git rm kwset.[ch]

 4. Migrate more things to some light wrapper "do a regex match" API
    (just small modifications to grep.c) that'll optimistically be able
    to do BRE/ERE/PCRE matches. If you've got PCRE then PCRE can handle
    them all (it's got a RX dialecttranslation interface).

 5. All our regcomp/regexec should be gone in favor if this simple
    replacement API.

Which is where this "PCRE everywhere" would come in, at this point we'll
be carrying a compat/regex/ which we could just as well replace with a
compat/pcre/.

The reason it would matter to have a hard prereq on PCRE is because it
would simplify and speed up a lot of code in various places to be able
to make an assumption that it's there. E.g. for things that do "match a
line" like pickaxe we could do clever anchoring and match the whole diff
at once, and also things like streaming the diff out as we consume it
into PCRE's stream matching API.

We also have a lot of things purely on the C level that would be
simpler, e.g. (just one thing I remember) grep.c has 40 lines of code to
emulate \b<word>\b, but we could ... just use \b.

To Jeff's comment upthread:

> And while it may build portably everywhere, that may involve extra
> work and not integrate with our build knobs (e.g., it looks like it
> uses autoconf)

I already did that work in an old WIP branch I have. I haven't run it
again so maybe this summary is inaccurate, but here's the commit:
https://github.com/avar/git/commit/79256ee4a1

So basically it won't build with autoconf or whatever, just with our own
build system after we'd wget the relevant PCREv2 source files into
compat/pcre/.

The only things PCRE really needed autoconf or its own build system for
was to define 20-ish macros which are either options we can switch, or
(looking at this now) 4x probes like "HAVE_STDINT_H" which we could add
to compat.mak.uname and friends.

If we wanted it to Just Work we could run that script and "git commit"
it, or just leave it at a make warning saying "you don't have PCRE,
either install it using your pkg system, or run this handy shellscript".

But yeah, if Junio's paranoid about the security aspect we could skip
that "./get-pcre.sh && git commit" step.

I'm getting some deja-vu writing this next part, I'm pretty sure I've
written some version of the same E-Mail before, in any case:

 A. This wouldn't be anything new. We have some old bitrotting copy of
    GNU awk's copy of glibc's regex engine in compat/regex. It's trivial
    to e.g. get it to exhaust all memory on your system.

    Sure, PCRE could have bugs, but at least we *could* update it. As
    you may remember I abandoned my last attempt to update compat/regex/

 B. Even if you assume some bug like that, I think realistically no user
    anywhere would be vulnerable to a copy we'd ship in compat/pcre/,
    any vendor would ignore it and build the prereq themselves. It's
    purely a developer aid.

    Of course it could have some CVE, but it wouldn't be mean a git
    point release.

 C. Even if the regex engine has an CVE/RCE we have no feature anywhere
    to execute arbitrary user regexes, anyone who could inject one can
    already run arbitrary git commands.

 D. The *one* exception to C) is gitweb's not-on-by-default "grep the
    source tree" feature.

    So first its users would be better off than they are now now. You
    can trivially DoS any box that has it exposed, secondly the union of
    gitweb users + those that would turn that on + build their own git
    for running a public-facing site is probably 0 people these days.

 E. Even if you assume some combination of B..D isn't true I genuinely
    don't know why this concern of PCRE being a special
    security/vulnerability concern comes from, it seems to be brought up
    any time the topic is discussed.

    The implicit context is that we already process regexes using a
    plethora of engines, and are using anything from AIXs, SunOS's,
    Window's, OSX's, GNU libc's etc. regex engines to do so.

    I haven't seen anything to suggest that PCRE's security record is
    notably worse than even one of those pieces of software, and right
    now we're exposing any bugs in the union of all those engines.

    It's quite the opposite, unlike those engines PCRE actually gives
    you really good features to limit resource consumption and regex
    features in the event that you don't trust your regex or your
    input.

    I'd actually trust for things like processing a "git grep on the
    web" with appropriate resource constraints.

    The two solutions in that space are basically to use DFA engine as
    e.g. Google's re2 does. That means living with a severely restricted
    set of regex features, or to have bells and whistles as PCRE's NFA
    does, but one that has tweakable resource limits.