Re: Regex optimization

Yuri Voinov <yvoinov@xxxxxxxxx> · Wed, 27 Apr 2016 20:22:24 +0600

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

27.04.16 20:01, Amos Jeffries пишет:
> On 27/04/2016 11:32 p.m., Alfredo Rezinovsky wrote:
>> I saw in debug log that when an ACL has many regexes each one is compared
>> sequentially.
>>
>> If I have
>>
>> www.facebook.com
>> facebook.com
>> www.google.com
>> google.com
>>
>> If will be faster to check just ONE optimized regex like
>> (www\.)?(facebook|google).com than the previous three?
>>
>> I'm really talking about optimizing about 3000 url regexes in one huge
>> regex because comparing each and every url to 3000 regexes is too slow.
>
> As Yuri was trying to point out (I think) simply using one bigger regex
> pattern is not always meaning faster.
Absolutely yes.

For example: By my experience, expression (.*) for group selecting uses
much more steps than (.*?) or (.+?). Yes, often last expressions has
another meaning, but as part of optimization this method - as partial
solution - is useful.

Also, the site I point contains "explanation" section, which is good
starting point in performance tuning of regexps.

In two words: You can think that regex "steps" is equivalent of "CPU
cycles". Just to simplify. And yes, this is direct dependency - more
steps - more cycles - slower execution.
>
>
>
>>
>> I know using
>> (www\.facebook\.com)|(facebook\.com)|(www\.google\.com)|(google\.com)
with
>> PCRE will produce the same optimized result as
>> (www\.)?(facebook|google)\.com. Squid uses GnuRegex. Does GNURegex lib
>> optimizes this as well ?
>
> If you actually pass GNURegex that *single* pattern. Yes, it will do
> some optimization. Though I'm not sure how much exactly in comparison to
> PCRE.
>
>  * Also, while GNURegex is the built-in backup regex engine bundled with
> Squid. It really is only a backup engine for systems like Windows which
> dont provide a regex engine. The stdlib regex library is always used if
> available. On some OS that stdlib engine is GNU, on others PCRE or
> something even better.
>
>
> What you see in the log is the fact that Squid is actually *not*
> configured with a single compound "optimized" pattern. You are actually
> using a file with ~3000 patterns in it ... so 3000 regex patterns to be
> checked against the URL.
>
> Whether Squid checks 3000 tests or some smaller number depends on what
> Squid version you are using. The recent versions do some trivial pattern
> aggregation and stripping away prefix/suffix ".*" garbage to help the
> library optimize better. But as Yuri showed, bigger pattern is not
> necessarily better *steps* for per-test speed. The gains are mostly in
> reduced Squid code CPU time and RAM overheads.
> Regex is still the slowest of the ACLs in terms of raw CPU consumed.
>
>
> The biggest problem with using regex for domain name lists is that regex
> is optimized for left-to-right comparisons. Domain name labels are built
> right-to-left. dstdomain is optimized for right-to-left comparison with
> an early-abort on mismatch and sub-domain wildcards - which gives it a
> huge advantage in CPU cycles over regex.
>
> Amos
>
> _______________________________________________
> squid-users mailing list
> squid-users@xxxxxxxxxxxxxxxxxxxxx
> http://lists.squid-cache.org/listinfo/squid-users

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2

iQEcBAEBCAAGBQJXIMsfAAoJENNXIZxhPexGuZ8H/2DNMNKp3u/3kmOsUczWH4KG
mP09zPzbPu7veniLOR30RGFZEbAFr0UxPGnaASyzzRMbJZ2ChAqUEtwsJvT2+lCL
g0lNZ5GPdnBh8DECrR0Cu5cV67Y8fXeQRdxYJlnjQdD4UH5thg6iZbOYNqOZLkOr
FiCpK6m6J32QH9EgI5x8GwhZBxpEJLyilqeAaku3kxTY4yqeguiSh6L4srfYhc+U
EPCR7q+dYrQ1UuroenHlCYnXLX/KmDD5AUA5AdxML1bNpTo1z7tVrdDVXbbBofIb
CZ+Y9duuBtJ5zaYi2qVbROolx7GDDwT2zdhniA+UNaMhx6k2RMnKZHTcFScfsE8=
=2fLk
-----END PGP SIGNATURE-----

Attachment:
0x613DEC46.asc

Description: application/pgp-keys
_______________________________________________
squid-users mailing list
squid-users@xxxxxxxxxxxxxxxxxxxxx
http://lists.squid-cache.org/listinfo/squid-users