Re: Res: squid 3.2.0.5 smp scaling issues

Alex Rousskov <rousskov@xxxxxxxxxxxxxxxxxxxxxxx> · Mon, 25 Apr 2011 16:35:00 -0600

On 04/14/2011 09:06 PM, david@xxxxxxx wrote:
> Ok, I finally got a chance to test 2.7STABLE9
> 
> it performs about the same as squid 3.0, possibly a little better.
> 
> with my somewhat stripped down config (smaller regex patterns, replacing
> CIDR blocks and names that would need to be looked up in /etc/hosts with
> individual IP addresses)
> 
> 2.7 gives ~4800 requests/sec
> 3.0 gives ~4600 requests/sec
> 3.2.0.6 with 1 worker gives ~1300 requests/sec
> 3.2.0.6 with 5 workers gives ~2800 requests/sec

Glad you did not see a significant regression between v2.7 and v3.0. We
have heard rather different stories. Every environment is different, and
many lab tests are misguided, of course, but it is still good to hear
positive reports.

The difference between v3.2 and v3.0 is known and have been discussed on
squid-dev. A few specific culprits are also known, but more need to be
identified. We are working on identifying these performance bugs and
reducing that difference.

As for 1 versus 5 worker difference, it seems to be specific to your
environment (as discussed below).

> the numbers for 3.0 are slightly better than what I was getting with the
> full ruleset, but the numbers for 3.2.0.6 are pretty much exactly what I
> got from the last round of tests (with either the full or simplified
> ruleset)
> 
> so 3.1 and 3.2 are a very significant regression from 2.7 or 3.0, and
> the ability to use multiple worker processes in 3.2 doesn't make up for
> this.
> 
> the time taken seems to almost all be in the ACL avaluation as
> eliminating all the ACLs takes 1 worker with 3.2 up to 4200 requests/sec.

If ACLs are the major culprit in your environment, then this is most
likely not a problem in Squid source code. AFAIK, there are no locks or
other synchronization primitives/overheads when it comes to Squid ACLs.
The solution may lie in optimizing some 3rd-party libraries (used by
ACLs) or in optimizing how they are used by Squid, depending on what
ACLs you use. As far as Squid-specific code is concerned, you should see
nearly linear ACL scale with the number of workers.

> one theory is that even though I have IPv6 disabled on this build, the
> added space and more expensive checks needed to compare IPv6 addresses
> instead of IPv4 addresses accounts for the single worker drop of ~66%.
> that seems rather expensive, even though there are 293 http_access lines
> (and one of them uses external file contents in it's acls, so it's a
> total of ~2400 source/destination pairs, however due to the ability to
> shortcut the comparison the number of tests that need to be done should
> be <400)

Yes, IPv6 is one of the known major performance regression culprits, but
IPv6 ACLs should still scale linearly with the number of workers, AFAICT.

Please note that I am not an ACL expert. I am just talking from the
overall Squid SMP design point of view and from our testing/deployment
experience point of view.

> In addition, there seems to be some sort of locking betwen the multiple
> worker processes in 3.2 when checking the ACLs

There are pretty much no locks in the current official SMP code. This
will change as we start adding shared caches in a week or so, but even
then the ACLs will remain lock-free. There could be some internal
locking in the 3rd-party libraries used by ACLs (regex and such), but I
do not know much about them.

HTH,

Alex.

>> On Wed, 13 Apr 2011, Marcos wrote:
>>
>>> Hi David,
>>>
>>> could you run and publish your benchmark with squid 2.7 ???
>>> i'd like to know if is there any regression between 2.7 and 3.x series.
>>>
>>> thanks.
>>>
>>> Marcos
>>>
>>>
>>> ----- Mensagem original ----
>>> De: "david@xxxxxxx" <david@xxxxxxx>
>>> Para: Amos Jeffries <squid3@xxxxxxxxxxxxx>
>>> Cc: squid-users@xxxxxxxxxxxxxxx; squid-dev@xxxxxxxxxxxxxxx
>>> Enviadas: S?bado, 9 de Abril de 2011 12:56:12
>>> Assunto: Re:  squid 3.2.0.5 smp scaling issues
>>>
>>> On Sat, 9 Apr 2011, Amos Jeffries wrote:
>>>
>>>> On 09/04/11 14:27, david@xxxxxxx wrote:
>>>>> A couple more things about the ACLs used in my test
>>>>>
>>>>> all of them are allow ACLs (no deny rules to worry about precidence
>>>>> of)
>>>>> except for a deny-all at the bottom
>>>>>
>>>>> the ACL line that permits the test source to the test destination has
>>>>> zero overlap with the rest of the rules
>>>>>
>>>>> every rule has an IP based restriction (even the ones with
>>>>> url_regex are
>>>>> source -> URL regex)
>>>>>
>>>>> I moved the ACL that allows my test from the bottom of the ruleset to
>>>>> the top and the resulting performance numbers were up as if the other
>>>>> ACLs didn't exist. As such it is very clear that 3.2 is evaluating
>>>>> every
>>>>> rule.
>>>>>
>>>>> I changed one of the url_regex rules to just match one line rather
>>>>> than
>>>>> a file containing 307 lines to see if that made a difference, and it
>>>>> made no significant difference. So this indicates to me that it's not
>>>>> having to fully evaluate every rule (it's able to skip doing the regex
>>>>> if the IP match doesn't work)
>>>>>
>>>>> I then changed all the acl lines that used hostnames to have IP
>>>>> addresses in them, and this also made no significant difference
>>>>>
>>>>> I then changed all subnet matches to single IP address (just nuked /##
>>>>> throughout the config file) and this also made no significant
>>>>> difference.
>>>>>
>>>>
>>>> Squid has always worked this way. It will *test* every rule from the
>>>> top down to the one that matches. Also testing each line
>>>> left-to-right until one fails or the whole line matches.
>>>>
>>>>>
>>>>> so why are the address matches so expensive
>>>>>
>>>>
>>>> 3.0 and older IP address is a 32-bit comparison.
>>>> 3.1 and newer IP address is a 128-bit comparison with memcmp().
>>>>
>>>> If something like a word-wise comparison can be implemented faster
>>>> than memcmp() we would welcome it.
>>>
>>> I wonder if there should be a different version that's used when IPv6
>>> is disabled. this is a pretty large hit.
>>>
>>> if the data is aligned properly, on a 64 bit system this should still
>>> only be 2 compares. do you do any alignment on the data now?
>>>
>>>>> and as noted in the e-mail below, why do these checks not scale nicely
>>>>> with the number of worker processes? If they did, the fact that one
>>>>> 3.2
>>>>> process is about 1/3 the speed of a 3.0 process in checking the acls
>>>>> wouldn't matter nearly as much when it's so easy to get an 8+ core
>>>>> system.
>>>>>
>>>>
>>>> There you have the unknown.
>>>
>>> I think this is a fairly critical thing to figure out.