Can we just get rid of kwset & obstack in favor of optimistically using PCRE v2 JIT?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Jun 13 2019, Johannes Schindelin via GitGitGadget wrote:

> The kwset functionality makes use of the obstack code, which expects to
> be handed a function that can allocate large chunks of data. It expects
> that function to accept a `size` parameter of type `long`.
>
> This upsets GCC 8 on Windows, because `long` does not have the same
> bit size as `size_t` there.
>
> Now, the proper thing to do would be to switch to `size_t`. But this
> would make us deviate from the "upstream" code even further, making it
> hard to synchronize with newer versions, and also it would be quite
> involved because that `long` type is so invasive in that code.
>
> Let's punt, and instead provide a super small wrapper around
> `xmalloc()`.

I have a WIP patches from 2017 that do $subject that I can dig up, but
thought I'd gauge interest first.

Right now the grep code & pickaxe machinery will detect fixed strings
and use kwset() as an optimization.

Back when kwset was added in 9eceddeec6 ("Use kwset in grep",
2011-08-21) this helped, but now doing this for grep with a PCRE pattern
is actually counterproductive for performance. On top of current
`master`:

    @@ -368 +368 @@ static int is_fixed(const char *s, size_t len)
    -       return 1;
    +       return 0;

And running p7821-grep-engines-fixed.sh[1] (which is in git.git, and is
as far as I got with this) we get:

    Test                             HEAD~             HEAD
    -------------------------------------------------------------------------
    7821.1: fixed grep int           0.48(1.59+0.63)   0.48(1.53+0.68) +0.0%
    7821.2: basic grep int           0.55(1.64+0.51)   0.72(2.97+0.54) +30.9%
    7821.3: extended grep int        0.65(1.63+0.54)   0.77(2.92+0.60) +18.5%
    7821.4: perl grep int            1.01(1.62+0.55)   0.36(0.97+0.58) -64.4%
    7821.6: fixed grep uncommon      0.18(0.51+0.45)   0.18(0.51+0.46) +0.0%
    7821.7: basic grep uncommon      0.18(0.50+0.46)   0.30(1.36+0.33) +66.7%
    7821.8: extended grep uncommon   0.18(0.45+0.52)   0.28(1.37+0.37) +55.6%
    7821.9: perl grep uncommon       0.18(0.52+0.45)   0.16(0.28+0.54) -11.1%
    7821.11: fixed grep æ            0.31(1.28+0.39)   0.31(1.24+0.43) +0.0%
    7821.12: basic grep æ            0.30(1.29+0.38)   0.22(0.85+0.36) -26.7%
    7821.13: extended grep æ         0.30(1.26+0.40)   0.22(0.78+0.45) -26.7%
    7821.14: perl grep æ             0.30(1.33+0.34)   0.16(0.25+0.56) -46.7%

So what this means on my Debian box is that when we use PCRE with JIT
and just get out of its way and let it do its own fixed string matching
it's up to ~65% faster than the kwset() path.

The usual case of just feeding the fixed pattern to glibc's regex
function is slower, although as seen there when you grep for a
rarely-occurring non-ASCII string glibc does better now (the perils of
using ancient last-version-to-use-GPLv2 snapshots...).

So my plan for this (which I partially implemented) was to have a series
where if we have a fixed string and have PCRE v2 we'd use it instead of
kwset() for fixed strings.

It seems most packagers build with PCRE v2 now (CC:
git-packagers@). I.e. we'd keep something like compile_fixed_regexp()
(and as an aside just use PCRE's \Q...\E instead of our own escaping).

We'd have performance regression for platforms that use kwset() now but
don't build pcre2, or where pcre2 jit doesn't work. Does anyone care?

This would allow us to just "git rm" kwset.[ch] compat/obstack.[ch],
which is ~2k lines of tricky code, 1/2 of which we're currently doomed
to maintain a bitrotting version of due to license incompatibilities
with upstream[2].

As an aside there's other code in grep.c that we could similarly remove
in favor of optimistic PCRE v2 use, e.g. the -w case can be replaced by
\b<word>\b, but I found that less promising[3]. We can also get a huge
performance win for BRE and ERE patterns by using PCRE v2 with a
translation layer for those under the hood[4], but various solvable
backwards compatible headaches[5] related to that are why I got lost in
the weeds back in 2017 and didn't finish this.

But just the s/kwset/pcre2/ case is easy enough.

1. Via:

    GIT_PERF_REPEAT_COUNT=10 GIT_PERF_LARGE_REPO=~/g/linux GIT_PERF_7821_GREP_OPTS='' GIT_PERF_MAKE_OPTS='-j8 CFLAGS=-O3 USE_LIBPCRE2=YesPlease' ./run HEAD~ HEAD -- p7821-grep-engines-fixed.sh
2. https://public-inbox.org/git/87wohn95vb.fsf@xxxxxxxxxxxxxxxxxxx/
3. https://github.com/avar/git/commit/49ca92e799
4. https://github.com/avar/git/commit/a3cc090344
5. https://github.com/avar/git/commit/7dd367eb37




[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux