Re: [PATCH 0/2] perf: show that wildmatch() regressed for pathological cases in v2.0

Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> · Tue, 25 Apr 2017 00:45:27 +0200

On Mon, Apr 24, 2017 at 11:34 PM, Brandon Williams <bmwill@xxxxxxxxxx> wrote:
> On 04/24, Ęvar Arnfjörš Bjarmason wrote:
>> Russ Cox just published an article about how various glob()
>> implementations suffer from pathological performance when fed certain
>> pathological patterns like "a*a*a*a*b" given a file like "aaaaaaa...":
>> https://research.swtch.com/glob
>>
>> I was curious to see if this impacted git. It turns out it does. This
>> used to be a per-platform issue with git, since globbing was
>> implemented via fnmatch() by default before v1.8.4, and support for
>> using the OS fnmatch() was removed entirely in v2.0.0.
>>
>> This performance test shows the regression:
>>
>>     $ GIT_PERF_REPEAT_COUNT=1 GIT_PERF_MAKE_OPTS="[...]NO_WILDMATCH=YesPlease[...]" ./run v1.9.5 v2.0.0 v2.12.0 p0100-globbing.sh
>>     [...]
>>     Test                                                       v1.9.5            v2.0.0                    v2.12.0
>>     ------------------------------------------------------------------------------------------------------------------------------
>>     [...]
>>     0100.7: fileglob((a*)^nb) against file (a^100).t; n = 1    0.01(0.00+0.00)   0.00(0.00+0.00) -100.0%   0.01(0.00+0.00) +0.0%
>>     0100.8: fileglob((a*)^nb) against file (a^100).t; n = 2    0.01(0.00+0.00)   0.00(0.00+0.00) -100.0%   0.01(0.00+0.00) +0.0%
>>     0100.9: fileglob((a*)^nb) against file (a^100).t; n = 3    0.00(0.00+0.00)   0.00(0.00+0.00) =         0.01(0.00+0.00) +inf
>>     0100.10: fileglob((a*)^nb) against file (a^100).t; n = 4   0.00(0.00+0.00)   0.01(0.01+0.00) +inf      0.02(0.02+0.00) +inf
>>     0100.11: fileglob((a*)^nb) against file (a^100).t; n = 5   0.00(0.00+0.00)   0.20(0.19+0.00) +inf      0.24(0.23+0.00) +inf
>>     0100.12: fileglob((a*)^nb) against file (a^100).t; n = 6   0.00(0.00+0.00)   3.03(3.00+0.00) +inf      3.08(3.05+0.00) +inf
>>
>> And here's a one-liner to do the same:
>>
>>     $ time (rm -rf test; git init -q test && (cd test && touch $(perl -e 'print "a" x 100').t && git add a* && git commit -q -m"test" && git ls-files 'a*a*a*a*a*a*a*b'))
>>
>> Add or remove "a*"'s to adjust the runtime. With 6 that executes in 3
>> seconds on my system, 40 seconds with 7 etc.
>>
>> I don't think this is something we need to worry much about, if you
>> have a file like this and feed Git insane patterns you probably
>> deserve what you get.
>>
>> The real concern is if we have behavior like this and ever e.g. expose
>> globbing over the network, e.g. in some future upload-pack protocol.
>>
>> There are probably some web-based programs built on top of git that
>> are vulnerable to DoS attacks as a result of this, e.g. if they take
>> user-supplied globs and feed them to ls-files.
>
> I was taking a look at wildmatch a few months ago and have an unfinished
> patch to do some cleanup there.  I noticed this was inefficient but
> didn't expect those kinds of numbers.  I wonder how difficult it would
> be to rewrite it so that we don't have this issue.

Something I've started hacking on as part of a post-PCRE v2 series I'm
working on, which I'll submit after PCRE v2 support is merged, is to
replace various parts of regex / matching machinery with PCRE under
the hood if it's available, even when the user doesn't ask or care
that we're using PCRE.

The thing I'm working on now is to replace fixed-string grep/log
searches that we currently do via kwset with PCRE v2. It's quite a bit
faster than kwset, and allows us to fix a bunch of outstanding grep
TODO items.

It would similarly be interesting to replace wildmatch() with a shimmy
wrapper for PCRE that translates glob patterns to regexes. Not that we
need PCRE for that, we could probably just use regcomp(), but PCRE
would almost certainly be faster than wildmatch().

Of course the use-cases are different. For the kwset replacement we'd
just use PCRE under the hood because it's faster, whereas the reason
we integrated wildmatch() in the first place was to get consistent
behavior across all platforms, and PCRE is currently an optional
dependency.

I think it might make sense to solve this general class of problem by
eventually making PCRE a mandatory dependency of Git. Right now we
have copy/pasted versions of whatever the last GPLv2 version was of
kwset/wildmatch, which we then have to maintain.

Between kwset/wildmatch/various code in grep.c & diffcore-pickaxe.c we
probably have 2-3k lines of very tricky string matching code which
could be replaced by relatively trivial calls into PCRE.