Re: Res: Res: squid 3.2.0.5 smp scaling issues

david@xxxxxxx · Mon, 25 Apr 2011 16:28:03 -0700 (PDT)

On Mon, 25 Apr 2011, Marcos wrote:

thanks for your answer David.

i'm seeing too much feature been included at squid 3.x, but it's getting as 
slower as new features are added.

that's unfortunantly fairly normal.

i think squid 3.2 with 1 worker should be as fast as 2.7, but it's getting 
slower e hungry.

that's one major problem, but the fact that the ACL matching isn't scaling 
with more workers I think is what's killing us.

1 3.2 worker is ~1/3 the speed of 2.7, but with the easy availablity of 8+ 
real cores (not hyperthreaded 'fake' cores), you should still be able to 
get ~3x the performance of 2.7 by using 3.2.

unfortunantly that's not what's happening, and we end up topping out 
around 1/2-2/3 the performance of 2.7

David Lang

Marcos

----- Mensagem original ----
De: "david@xxxxxxx" <david@xxxxxxx>
Para: Marcos <mczueira@xxxxxxxxxxxx>
Cc: Amos Jeffries <squid3@xxxxxxxxxxxxx>; squid-users@xxxxxxxxxxxxxxx; 
squid-dev@xxxxxxxxxxxxxxx
Enviadas: Sexta-feira, 22 de Abril de 2011 15:10:44
Assunto: Re: Res:  squid 3.2.0.5 smp scaling issues

ping, I haven't seen a response to this additional information that I sent out 
last week.

squid 3.1 and 3.2 are a significant regression in performance from squid 2.7 or 
3.0

David Lang

On Thu, 14 Apr 2011, david@xxxxxxx wrote:

Subject: Re: Res:  squid 3.2.0.5 smp scaling issues

Ok, I finally got a chance to test 2.7STABLE9

it performs about the same as squid 3.0, possibly a little better.

with my somewhat stripped down config (smaller regex patterns, replacing CIDR 
blocks and names that would need to be looked up in /etc/hosts with individual 
IP addresses)

2.7 gives ~4800 requests/sec
3.0 gives ~4600 requests/sec
3.2.0.6 with 1 worker gives ~1300 requests/sec
3.2.0.6 with 5 workers gives ~2800 requests/sec

the numbers for 3.0 are slightly better than what I was getting with the full 
ruleset, but the numbers for 3.2.0.6 are pretty much exactly what I got from the 
last round of tests (with either the full or simplified ruleset)

so 3.1 and 3.2 are a very significant regression from 2.7 or 3.0, and the 
ability to use multiple worker processes in 3.2 doesn't make up for this.

the time taken seems to almost all be in the ACL avaluation as eliminating all 
the ACLs takes 1 worker with 3.2 up to 4200 requests/sec.

one theory is that even though I have IPv6 disabled on this build, the added 
space and more expensive checks needed to compare IPv6 addresses instead of IPv4 
addresses accounts for the single worker drop of ~66%. that seems rather 
expensive, even though there are 293 http_access lines (and one of them uses 
external file contents in it's acls, so it's a total of ~2400 source/destination 
pairs, however due to the ability to shortcut the comparison the number of tests 
that need to be done should be <400)

In addition, there seems to be some sort of locking betwen the multiple worker 
processes in 3.2 when checking the ACLs as the test with almost no ACLs scales 
close to 100% per worker while with the ACLs it scales much more slowly, and 
above 4-5 workers actually drops off dramatically (to the point where with 8 
workers the throughput is down to about what you get with 1-2 workers) I don't 
see any conceptual reason why the ACL checks of the different worker threads 
should impact each other in any way, let alone in a way that limits scalability 
to ~4 workers before adding more workers is a net loss.

David Lang

On Wed, 13 Apr 2011, Marcos wrote:

Hi David,

could you run and publish your benchmark with squid 2.7 ???
i'd like to know if is there any regression between 2.7 and 3.x series.

thanks.

Marcos

----- Mensagem original ----
De: "david@xxxxxxx" <david@xxxxxxx>
Para: Amos Jeffries <squid3@xxxxxxxxxxxxx>
Cc: squid-users@xxxxxxxxxxxxxxx; squid-dev@xxxxxxxxxxxxxxx
Enviadas: S?bado, 9 de Abril de 2011 12:56:12
Assunto: Re:  squid 3.2.0.5 smp scaling issues

On Sat, 9 Apr 2011, Amos Jeffries wrote:

On 09/04/11 14:27, david@xxxxxxx wrote:
A couple more things about the ACLs used in my test

all of them are allow ACLs (no deny rules to worry about precidence of)
except for a deny-all at the bottom

the ACL line that permits the test source to the test destination has
zero overlap with the rest of the rules

every rule has an IP based restriction (even the ones with url_regex are
source -> URL regex)

I moved the ACL that allows my test from the bottom of the ruleset to
the top and the resulting performance numbers were up as if the other
ACLs didn't exist. As such it is very clear that 3.2 is evaluating every
rule.

I changed one of the url_regex rules to just match one line rather than
a file containing 307 lines to see if that made a difference, and it
made no significant difference. So this indicates to me that it's not
having to fully evaluate every rule (it's able to skip doing the regex
if the IP match doesn't work)

I then changed all the acl lines that used hostnames to have IP
addresses in them, and this also made no significant difference

I then changed all subnet matches to single IP address (just nuked /##
throughout the config file) and this also made no significant difference.

Squid has always worked this way. It will *test* every rule from the top down 
to the one that matches. Also testing each line left-to-right until one fails or 
the whole line matches.

so why are the address matches so expensive

3.0 and older IP address is a 32-bit comparison.
3.1 and newer IP address is a 128-bit comparison with memcmp().

If something like a word-wise comparison can be implemented faster than 
memcmp() we would welcome it.

I wonder if there should be a different version that's used when IPv6 is 
disabled. this is a pretty large hit.

if the data is aligned properly, on a 64 bit system this should still only be 2 
compares. do you do any alignment on the data now?

and as noted in the e-mail below, why do these checks not scale nicely
with the number of worker processes? If they did, the fact that one 3.2
process is about 1/3 the speed of a 3.0 process in checking the acls
wouldn't matter nearly as much when it's so easy to get an 8+ core system.

There you have the unknown.

I think this is a fairly critical thing to figure out.