Re: Squid performance profiling

Amos Jeffries <squid3@xxxxxxxxxxxxx> · Sat, 22 Jun 2013 03:10:40 +1200

On 21/06/2013 10:34 p.m., Ahmed Talha Khan wrote:
On Fri, Jun 21, 2013 at 10:41 AM, Alex Rousskov
<rousskov@xxxxxxxxxxxxxxxxxxxxxxx> wrote:
On 06/20/2013 10:47 PM, Ahmed Talha Khan wrote:
On Fri, Jun 21, 2013 at 6:17 AM, Alex Rousskov wrote:
On 06/20/2013 02:00 AM, Ahmed Talha Khan wrote:
My test methodology looks like this

generator(apache benchmark)<------->squid<------>server(lighttpd)
...
These results show that squid is NOT CPU bound at this point. Neither
is it Network IO bound because i can get much more throughput when I
only run the generator with the server. In this case squid should be
able to do more. Where is the bottleneck coming from?

The "bottleneck" may be coming from your test methodology -- you are
allowing Squid to slow down the benchmark instead of benchmark driving
the Squid box to its limits. You appear to be using what we call a "best
effort" test, where the request rate is determined by Squid response
time. In most real-world environments concerned with performance, the
request rate does not decrease just because a proxy wants to slow down a
little.

Then the question becomes why squid is slowing down?
I think there are 2.5 primary reasons for that:

1) Higher concurrency level ("c" in your tables) means more
waiting/queuing time for each transaction: When [a part of] one
transaction has to wait for [a part of] another before being served,
transaction response time goes up. For example, the more network sockets
are "ready" at the same time, the higher the response time is going to
be for the transaction which socket happens to be the last one ready
during that specific I/O loop iteration.

Are these queues maintained internally inside squid? What can be done
to reduce this?

The queue is created in a single step by the kernel. It responds with a 
set of FD with I/O events to be handled. Squid is then expected to 
iterate over them and do the I/O.
Like Alex said there is nothing that can be done about that queue 
itself. Looping over it fast and scheduling multiple internal Calls at 
once is tempting but just offloads the delay from the 
select/poll/epoll/kqueue loop to the AsyncCall queue, the visible/total 
delay remains constant (or possibly worse if they are double queued).

2a) Squid sometimes uses hard-coded limits for various internal caches
and tables. With higher concurrency level, Squid starts hitting those
limits and operating less efficiently (e.g., not keeping a connection
persistent because the persistent connection table is full -- I do not
remember whether this actually happens, so this is just an example of
what could happen to illustrate 2a).
Can you point me to some of the key ones and their impact? So that I
can test by changing
these limits and seeing if it enhances/degrades the performance. Also,
any tweaks in
the network stack that might help with that. I am primarily interested
in enhancing the SSL performance.

Much of the lag in SSL is due to the handshake exchanges it requires. 
There are a small amount of bytes in each direction wasting entire 
packet round-trip times just to set it up, followed by the processing 
overheads of actually crypting the bits.

The certificate generation process is a well-known slow process, there 
is nothing that can be done there as it relies heavily on the random 
number generator in the machine. SSL-bump with certificate generation 
uses caching to avoid that to some extent - it would be worthwhile 
testing how often (if at all) your benchmarks are held up waiting for 
new certs to be created.

2b) Poor concurrency scale. Some Squid code becomes slower with more
concurrent transactions flying around because that code has to iterate
more structures while dealing with more collisions and such.

  Well all that can be done on this front is that I have to wait for
the changes to go in.

There is nothing we can do about #1, but we can improve #2a and #2b
(they are kind of related).

best effort tests also
give a good measure of what the proxy(server) can do without breaking
it.
Yes, but, in my experience, the vast majority of best-effort results are
misinterpreted: It is very difficult to use a best-effort test
correctly, and it is very easy to come to the wrong conclusions by
looking at its results. YMMV.

  Do you see any wrong conclusion that I might have made in
interpreting these results?

BTW, a "persistent load" test does not have to break the proxy. You only
need to break the proxy if you want to find where its breaking point
(and, hence, the bottleneck) is with respect to load (or other traffic
parameters).

Sure

Do you see any reason from the perf results/benchmarks
why squid would not be utilizing all CPU and giving out more requests
per second?
In our tests, Squid does utilize virtually all CPU cycles (if we push it
hard enough). It is just a matter of creating enough/appropriate load.

Why would it not do in my test setup? I does use all CPU cores to the
fullest in the case of HTTPS, but not in the case
of HTTP as i pointed out earlier

You are not caching for starters. So Squid will service all requests 
with an I/O overhead of contacting the backend server.

Amos