Res: Res: squid 3.2.0.5 smp scaling issues

Marcos <mczueira@xxxxxxxxxxxx> · Mon, 25 Apr 2011 12:15:29 -0700 (PDT)

thanks for your answer David.

i'm seeing too much feature been included at squid 3.x, but it's getting as 
slower as new features are added.
i think squid 3.2 with 1 worker should be as fast as 2.7, but it's getting 
slower e hungry.

Marcos

----- Mensagem original ----
De: "david@xxxxxxx" <david@xxxxxxx>
Para: Marcos <mczueira@xxxxxxxxxxxx>
Cc: Amos Jeffries <squid3@xxxxxxxxxxxxx>; squid-users@xxxxxxxxxxxxxxx; 
squid-dev@xxxxxxxxxxxxxxx
Enviadas: Sexta-feira, 22 de Abril de 2011 15:10:44
Assunto: Re: Res:  squid 3.2.0.5 smp scaling issues

ping, I haven't seen a response to this additional information that I sent out 
last week.

squid 3.1 and 3.2 are a significant regression in performance from squid 2.7 or 
3.0

David Lang

On Thu, 14 Apr 2011, david@xxxxxxx wrote:

> Subject: Re: Res:  squid 3.2.0.5 smp scaling issues
> 
> Ok, I finally got a chance to test 2.7STABLE9
> 
> it performs about the same as squid 3.0, possibly a little better.
> 
> with my somewhat stripped down config (smaller regex patterns, replacing CIDR 
>blocks and names that would need to be looked up in /etc/hosts with individual 
>IP addresses)
> 
> 2.7 gives ~4800 requests/sec
> 3.0 gives ~4600 requests/sec
> 3.2.0.6 with 1 worker gives ~1300 requests/sec
> 3.2.0.6 with 5 workers gives ~2800 requests/sec
> 
> the numbers for 3.0 are slightly better than what I was getting with the full 
>ruleset, but the numbers for 3.2.0.6 are pretty much exactly what I got from the 
>last round of tests (with either the full or simplified ruleset)
> 
> so 3.1 and 3.2 are a very significant regression from 2.7 or 3.0, and the 
>ability to use multiple worker processes in 3.2 doesn't make up for this.
> 
> the time taken seems to almost all be in the ACL avaluation as eliminating all 
>the ACLs takes 1 worker with 3.2 up to 4200 requests/sec.
> 
> one theory is that even though I have IPv6 disabled on this build, the added 
>space and more expensive checks needed to compare IPv6 addresses instead of IPv4 
>addresses accounts for the single worker drop of ~66%. that seems rather 
>expensive, even though there are 293 http_access lines (and one of them uses 
>external file contents in it's acls, so it's a total of ~2400 source/destination 
>pairs, however due to the ability to shortcut the comparison the number of tests 
>that need to be done should be <400)
> 
> 
> 
> In addition, there seems to be some sort of locking betwen the multiple worker 
>processes in 3.2 when checking the ACLs as the test with almost no ACLs scales 
>close to 100% per worker while with the ACLs it scales much more slowly, and 
>above 4-5 workers actually drops off dramatically (to the point where with 8 
>workers the throughput is down to about what you get with 1-2 workers) I don't 
>see any conceptual reason why the ACL checks of the different worker threads 
>should impact each other in any way, let alone in a way that limits scalability 
>to ~4 workers before adding more workers is a net loss.
> 
> David Lang
> 
> 
>> On Wed, 13 Apr 2011, Marcos wrote:
>> 
>>> Hi David,
>>> 
>>> could you run and publish your benchmark with squid 2.7 ???
>>> i'd like to know if is there any regression between 2.7 and 3.x series.
>>> 
>>> thanks.
>>> 
>>> Marcos
>>> 
>>> 
>>> ----- Mensagem original ----
>>> De: "david@xxxxxxx" <david@xxxxxxx>
>>> Para: Amos Jeffries <squid3@xxxxxxxxxxxxx>
>>> Cc: squid-users@xxxxxxxxxxxxxxx; squid-dev@xxxxxxxxxxxxxxx
>>> Enviadas: S?bado, 9 de Abril de 2011 12:56:12
>>> Assunto: Re:  squid 3.2.0.5 smp scaling issues
>>> 
>>> On Sat, 9 Apr 2011, Amos Jeffries wrote:
>>> 
>>>> On 09/04/11 14:27, david@xxxxxxx wrote:
>>>>> A couple more things about the ACLs used in my test
>>>>> 
>>>>> all of them are allow ACLs (no deny rules to worry about precidence of)
>>>>> except for a deny-all at the bottom
>>>>> 
>>>>> the ACL line that permits the test source to the test destination has
>>>>> zero overlap with the rest of the rules
>>>>> 
>>>>> every rule has an IP based restriction (even the ones with url_regex are
>>>>> source -> URL regex)
>>>>> 
>>>>> I moved the ACL that allows my test from the bottom of the ruleset to
>>>>> the top and the resulting performance numbers were up as if the other
>>>>> ACLs didn't exist. As such it is very clear that 3.2 is evaluating every
>>>>> rule.
>>>>> 
>>>>> I changed one of the url_regex rules to just match one line rather than
>>>>> a file containing 307 lines to see if that made a difference, and it
>>>>> made no significant difference. So this indicates to me that it's not
>>>>> having to fully evaluate every rule (it's able to skip doing the regex
>>>>> if the IP match doesn't work)
>>>>> 
>>>>> I then changed all the acl lines that used hostnames to have IP
>>>>> addresses in them, and this also made no significant difference
>>>>> 
>>>>> I then changed all subnet matches to single IP address (just nuked /##
>>>>> throughout the config file) and this also made no significant difference.
>>>>> 
>>>> 
>>>> Squid has always worked this way. It will *test* every rule from the top down 
>>>>to the one that matches. Also testing each line left-to-right until one fails or 
>>>>the whole line matches.
>>>> 
>>>>> 
>>>>> so why are the address matches so expensive
>>>>> 
>>>> 
>>>> 3.0 and older IP address is a 32-bit comparison.
>>>> 3.1 and newer IP address is a 128-bit comparison with memcmp().
>>>> 
>>>> If something like a word-wise comparison can be implemented faster than 
>>>>memcmp() we would welcome it.
>>> 
>>> I wonder if there should be a different version that's used when IPv6 is 
>>>disabled. this is a pretty large hit.
>>> 
>>> if the data is aligned properly, on a 64 bit system this should still only be 2 
>>>compares. do you do any alignment on the data now?
>>> 
>>>>> and as noted in the e-mail below, why do these checks not scale nicely
>>>>> with the number of worker processes? If they did, the fact that one 3.2
>>>>> process is about 1/3 the speed of a 3.0 process in checking the acls
>>>>> wouldn't matter nearly as much when it's so easy to get an 8+ core system.
>>>>> 
>>>> 
>>>> There you have the unknown.
>>> 
>>> I think this is a fairly critical thing to figure out.
>