On Sat, Apr 10 2021, Junio C Hamano wrote: > Jeff King <peff@xxxxxxxx> writes: > >>> That plan involves an "git rm -r compat/regex" and a compat/pcre >>> instead, I have some long-left-over patches for that. >> >> OK. I was more worried about platforms where it was cumbersome to >> install pcre. If we are shipping it as a vendored library, then at least >> the dependency management is on us, and not the user. I do worry a >> little about running into complications with building or debugging it. > > Please don't. I prefer not shipping compat/pcre ourselves; having > to worry about version skew and picking up upstream security fixes > in time, etc. for an external library that is well maintained is not > what we need. I'm not submitting patches for this now, I personally don't care much if PCRE became a required dep (as in you need the library) and we didn't ship it. I just assumed if I'd propose that that we'd want "make" to work on more bare-bones systems, since we e.g. carry a copy of sha1collisiondetection/ partially to facilitate that. But FWIW, and to the extent I've got time etc. my rough plan for this larger PCRE conpsiary was: 1. Get the current pickaxe-prep-for-PCRE patches to land. 2. I've got patches on top of that which move over to PCRE, so as with the existing grep.c code we could drop kwset entirely (declare that if you need optimized fast search, you can install PCRE). 3. git rm kwset.[ch] 4. Migrate more things to some light wrapper "do a regex match" API (just small modifications to grep.c) that'll optimistically be able to do BRE/ERE/PCRE matches. If you've got PCRE then PCRE can handle them all (it's got a RX dialecttranslation interface). 5. All our regcomp/regexec should be gone in favor if this simple replacement API. Which is where this "PCRE everywhere" would come in, at this point we'll be carrying a compat/regex/ which we could just as well replace with a compat/pcre/. The reason it would matter to have a hard prereq on PCRE is because it would simplify and speed up a lot of code in various places to be able to make an assumption that it's there. E.g. for things that do "match a line" like pickaxe we could do clever anchoring and match the whole diff at once, and also things like streaming the diff out as we consume it into PCRE's stream matching API. We also have a lot of things purely on the C level that would be simpler, e.g. (just one thing I remember) grep.c has 40 lines of code to emulate \b<word>\b, but we could ... just use \b. To Jeff's comment upthread: > And while it may build portably everywhere, that may involve extra > work and not integrate with our build knobs (e.g., it looks like it > uses autoconf) I already did that work in an old WIP branch I have. I haven't run it again so maybe this summary is inaccurate, but here's the commit: https://github.com/avar/git/commit/79256ee4a1 So basically it won't build with autoconf or whatever, just with our own build system after we'd wget the relevant PCREv2 source files into compat/pcre/. The only things PCRE really needed autoconf or its own build system for was to define 20-ish macros which are either options we can switch, or (looking at this now) 4x probes like "HAVE_STDINT_H" which we could add to compat.mak.uname and friends. If we wanted it to Just Work we could run that script and "git commit" it, or just leave it at a make warning saying "you don't have PCRE, either install it using your pkg system, or run this handy shellscript". But yeah, if Junio's paranoid about the security aspect we could skip that "./get-pcre.sh && git commit" step. I'm getting some deja-vu writing this next part, I'm pretty sure I've written some version of the same E-Mail before, in any case: A. This wouldn't be anything new. We have some old bitrotting copy of GNU awk's copy of glibc's regex engine in compat/regex. It's trivial to e.g. get it to exhaust all memory on your system. Sure, PCRE could have bugs, but at least we *could* update it. As you may remember I abandoned my last attempt to update compat/regex/ B. Even if you assume some bug like that, I think realistically no user anywhere would be vulnerable to a copy we'd ship in compat/pcre/, any vendor would ignore it and build the prereq themselves. It's purely a developer aid. Of course it could have some CVE, but it wouldn't be mean a git point release. C. Even if the regex engine has an CVE/RCE we have no feature anywhere to execute arbitrary user regexes, anyone who could inject one can already run arbitrary git commands. D. The *one* exception to C) is gitweb's not-on-by-default "grep the source tree" feature. So first its users would be better off than they are now now. You can trivially DoS any box that has it exposed, secondly the union of gitweb users + those that would turn that on + build their own git for running a public-facing site is probably 0 people these days. E. Even if you assume some combination of B..D isn't true I genuinely don't know why this concern of PCRE being a special security/vulnerability concern comes from, it seems to be brought up any time the topic is discussed. The implicit context is that we already process regexes using a plethora of engines, and are using anything from AIXs, SunOS's, Window's, OSX's, GNU libc's etc. regex engines to do so. I haven't seen anything to suggest that PCRE's security record is notably worse than even one of those pieces of software, and right now we're exposing any bugs in the union of all those engines. It's quite the opposite, unlike those engines PCRE actually gives you really good features to limit resource consumption and regex features in the event that you don't trust your regex or your input. I'd actually trust for things like processing a "git grep on the web" with appropriate resource constraints. The two solutions in that space are basically to use DFA engine as e.g. Google's re2 does. That means living with a severely restricted set of regex features, or to have bells and whistles as PCRE's NFA does, but one that has tweakable resource limits.