Re: squid 3.2.0.5 smp scaling issues

david@xxxxxxx · Sat, 9 Apr 2011 08:56:12 -0700 (PDT)

On Sat, 9 Apr 2011, Amos Jeffries wrote:

On 09/04/11 14:27, david@xxxxxxx wrote:
A couple more things about the ACLs used in my test

all of them are allow ACLs (no deny rules to worry about precidence of)
except for a deny-all at the bottom

the ACL line that permits the test source to the test destination has
zero overlap with the rest of the rules

every rule has an IP based restriction (even the ones with url_regex are
source -> URL regex)

I moved the ACL that allows my test from the bottom of the ruleset to
the top and the resulting performance numbers were up as if the other
ACLs didn't exist. As such it is very clear that 3.2 is evaluating every
rule.

I changed one of the url_regex rules to just match one line rather than
a file containing 307 lines to see if that made a difference, and it
made no significant difference. So this indicates to me that it's not
having to fully evaluate every rule (it's able to skip doing the regex
if the IP match doesn't work)

I then changed all the acl lines that used hostnames to have IP
addresses in them, and this also made no significant difference

I then changed all subnet matches to single IP address (just nuked /##
throughout the config file) and this also made no significant difference.

Squid has always worked this way. It will *test* every rule from the top down 
to the one that matches. Also testing each line left-to-right until one fails 
or the whole line matches.

so why are the address matches so expensive

3.0 and older IP address is a 32-bit comparison.
3.1 and newer IP address is a 128-bit comparison with memcmp().

If something like a word-wise comparison can be implemented faster than 
memcmp() we would welcome it.

I wonder if there should be a different version that's used when IPv6 is 
disabled. this is a pretty large hit.

if the data is aligned properly, on a 64 bit system this should still only 
be 2 compares. do you do any alignment on the data now?

and as noted in the e-mail below, why do these checks not scale nicely
with the number of worker processes? If they did, the fact that one 3.2
process is about 1/3 the speed of a 3.0 process in checking the acls
wouldn't matter nearly as much when it's so easy to get an 8+ core system.

There you have the unknown.

I think this is a fairly critical thing to figure out.

it seems to me that all accept/deny rules in a set should be able to be
combined into a tree to make searching them very fast.

so for example if you have

accept 1
accept 2
deny 3
deny 4
accept 5

you need to create three trees (one with accept 1 and accept 2, one with
deny3 and deny4, and one with accept 5) and then check each tree to see
if you have a match.

the types of match could be done in order of increasing cost, so if you

The config file is specific structure configured by admin under guaranteed 
rules of operation for access lines (top-down, left-to-right, 
first-match-wins) to perform boolean-logic calculations using ACL sets.
Sorting access line rules is not an option.
Sorting ACL values and tree-forming them is already done (regex being the 
one exception AFAIK).
Sorting position-wise on a single access line is also ruled out by 
interactions with deny_info, auth and external ACL types.

It would seem that as long as you don't cross boundries between the 
different types, you should be able to optimize within a group.

using my example above, you couldn't combine the 'accept 5' with any of 
the other accepts, but you could combine accept 1 and 2 and combine deny 3 
and 4 togeather.

now, I know that I don't fully understand all the possible ACL types, so 
this may not work for some of them, but I believe that a fairly common use 
case is to have either a lot of allow rules, or a lot of deny rules as a 
block (either a list of sites you are allowed to access, or a list of 
sites that are blocked), so an ability to optimize these use cases may be 
well worth it.

have acl entries of type port, src, dst, and url regex, organize the
tree so that you check ports first, then src, then dst, then only if all
that matches do you need to do the regex. This would be very similar to
the shortcut logic that you use today with a single rule where you bail
out when you don't find a match.

you could go with a complex tree structure, but since this only needs to
be changed at boot time,

Um, "boot"/startup time and arbitrary "-k reconfigure" times.
With a reverse-configuration display dump on any cache manager request.

still a pretty rare case, and one where you can build a completely new 
ruleset and swap it out. My point was that this isn't something that you 
have to be able to update dynamically.

it seems to me that a simple array that you can
do a binary search on will work for the port, src, and dst trees. The
url regex is probably easiest to initially create by just doing a list
of regex strings to match and working down that list, but eventually it

This is already how we do these. But with a splay tree instead of binary.

I wondered about that. I've gotten splay tree related warning messages.

may be best to create a parse tree so that you only have to walk down
the string once to see if you have a match.

That would be nice. Care to implement?
You just have to get the regex library to adjust its pre-compiled patterns 
with OR into (existing|new) whenever a new pattern string is added to an ACL.

I'm actually watching the libnorm project closely, with an intention of 
leveraging it for this sort of thing. It's a project being developed by 
the maintainer of rsyslog trying to efficiently match log entries. it 
doesn't support regex notation for defining it's rules at this point, but 
I've poked Rainer in that direction, so we'll see how things go.

you wouldn't quite be able to get this fast as you would have to
actually do two checks, one if you have a match on that level and one
for the rules that don't specify something in the current tree (one
check for if the http_access line specifies a port number and one for if
it doesn't for example)

We get around this problem by using C++ types. ACLChecklist walks the tree 
holding the current location, expected result, and all details available 
about the transaction. Each node in the tree has a match() function which 
gets called at most once per walk. Each ACL data type provides its own 
match() algorithm.

That is why the following config is invalid:
acl foo src 1.2.3.4
acl foo port 80

Ok, makes sense. I'll have to dig into that, thanks for the pointer of the 
function to look for.

this sort of acl structure would reduce a complex ruleset down to ~O(log
n) instead of the current O(n) (a really complex ruleset would be log n
of each tree added togeather)

there are cases where this sort of thing would be more expensive than
the current, simple approach, but those would be on simple rulesets
which aren't affected much by a few extra checks.

Um, You have pretty much described our existing code. Even with the details 
of *how* hidden away the small bit exposed to admin is fairly complex.

good ;-) I'm glad when I suggest an approach and find that the project is 
already doing things the way I think is best.

I am thinking you did not understand ACLs very well earlier when designing 
your config rules. With large rulsets that can easily lead to an inefficient 
config and the worst-case results you seem to have achieved.
If you care to share your actual configuration file contents I'm happy to 
read through and point out any optimizations that can be made.
Though you may want to use the above info and see if you can find any 
optimizations first.

I'll have to do some sanitizeing of the rules before I can send it out. 
I'll try and figure out how to do this without destroying the ability to 
check things.

the basic problem is that this is a whitelist of what is allowed to be 
accessed. I know that there are some problems with rules that overlap, but 
that's not a large part of the issue (usually if rules overlap, the more 
general rule is wrong, but the application developers have done something 
stupid and it needs to be in there until they fix the application)

The ACL storage types are all defined in the src/acl/*Data.* source files. If 
you wish to work on finding us some faster types or even faster matching 
algorithm for an existing type that would be welcome.
We do ask for some unit/micro-benchmarks of the old vs new match() function 
so we know the change is an actual improvement.

I don't know how much I will get a chance to do any coding on this (as 
opposed to being available to test anything that you folks try, testing I 
will make time for), but I definantly agree on the need to do benchmarks.

again, thanks for the info.

David Lang