Re: Squid performance profiling

Amos Jeffries <squid3@xxxxxxxxxxxxx> · Thu, 20 Jun 2013 21:51:16 +1200

On 20/06/2013 8:00 p.m., Ahmed Talha Khan wrote:
Hello All,

I have been trying to benchmark the performance of squid for sometime
now for plain HTTP and HTTPS traffic.

The key performance indicators that i am looking at are Requests Per
Second(RPS), Throughput(mbps) and Latency (ms).

My test methodology looks like this

generator(apache benchmark)<------->squid<------>server(lighttpd)

All 3 are running on seperate VM on AWS.
The specs for all the machines are
8 VCPU @ 2.13 GHZ
16 GB RAM
Squid using 8 SMP workers to utilize all cores

Using 8 workers is probably not a godo idea. The recommended practice is 
to use one core per worker and leave at leave one spare core for the 
kernels usage. Squid does pass a fair chunk of work to the kernel for 
I/O, while each workers will completely max out as many CPU cycles as it 
can grab from its own core. If there is no core retained for kernel 
usage those two properties will result in CPU contention slowdown as 
Squid and kernel fight for cycles.

In all these tests I have made sure that the generator and server are
always more powerful than squid. For latency calculation, Time per
request is calculated with and without squid inline and the difference
between them is taken.

I am using a release 3.HEAD just prior to the release of 3.3.

Then please upgrade to 3.3 stable release or a current 3.HEAD . There 
have been a few memory leaks and issues resolved in the time since 3.3 
was released which are fixed in the current stable. There are also 
additional performance improvements in the current 3.HEAD which will be 
in 3.4 when it branches.

I want to share the results with the community on the squid wikis. How
to do that?

We are collecting some ad-hoc benchmark details for Squid releases at 
http://wiki.squid-cache.org/KnowledgeBase/Benchmarks. So far this is not 
exactly a rigourous testing, although following the methodology for 
stats collection (as outline in the intro section) retains consistency 
and improves comparability between submissions.

Since you are using a different methodology, please feel free to write 
up a new article on it. The details you just posted looks like a good 
start. We can offer wiki or static web page, or reference from our 
benchmarking page to a blog publication of your own.

If you are intending to publish the results I do highly recommend that 
you settle on a packaged and numbered version of Squid so others can 
replicate the tests or do additional compartive testing on the same 
code. The 3.HEAD is a rolling release that is relatively difficult to 
locate the exact sources for any given revision,  the numbered packages 
can be referenced from our permanent archives in your description.

Some results from the tests are:

Server response size = 200 Bytes
New means keep-alive were turned
Keep-alive mean keep-alive were used with 100 http req/conn
C= concurrent requests

  HTTP                                       HTTPS
                                                                 New
| Keep-Alive                   New    | Keep-Alive

RPS
                                       c= 50               6466 | 20227
                           1336 | 14461
                                       c= 100             6392 | 21583
                          1303 | 14683
                                       c = 200            5986 | 21462
                           1300 | 13967

Throughput(mbps)
                                       c = 50               26    |
82.4                                    5.4 | 59
                                       c=100               25.8 | 88
                                   5.25 | 60
                                        c=200              24 | 88
                                     5.4 | 58

Latency(ms)
                                        c= 50              7.5 | 2.7
                                    36 | 3.75
                                        c= 100            15.8 | 5.27
                                80 | 8
                                       c=200              26.5 | 11.3
                                168 | 18

With this results I profile squid with "perf" tool and got some
results that I could not understand. So my question are related to
them

Thank you. Some very nice numbers. I hope they give a clue to anyone 
still thinking persistent connections need to be disabled to improve 
performance.

For the HTTS case, the CPU utilization peaks around 90% on all cores
and the perf profiler gives:

24.63%    squid  libc-2.15.so         [.] __memset_sse2

6.13%    squid  libcrypto.so.1.0.0   [.] bn_sqr4x_mont

     4.98%    squid  [kernel.kallsyms]    [k] hypercall_page

               |

               --- hypercall_page

                  |

                  |--93.73%-- check_events

Why is so much time spent in one instruction by squid? and too a
memset instruction! Any pointers?

Squid was originally written in C and still has a lot of memset() calls 
around the place clearing memory before use. We have made a few attempts 
to track them down and remove unnecessary usages but a lot still remain. 
Another attempt was tried in the more recent code, so you may find a 
lower profile rating in the current 3.HEAD.

Also check whether you have memory_pools on or off. That can affect the 
amount of calls to memset().

Since in this case all CPU power is being used so it is understandable
that  the performance cannot be improved here. The problem arises with
the HTTP case.

On the contrary, code improvements can be done to reduce CPU cycle 
requirements by Squid, which in turn raise the performance. If your 
profiling can highlight things like memset() or Squid functions in the 
current consuming large amounts of CPU effort can be targeted at 
reducing those occurances for best work/performance gains.

For the plain HTTP case, the CPU utilization is only around 50-60% on
all the cores and perf says:

8.47%    squid  [kernel.kallsyms]    [k] hypercall_page
                           --- hypercall_page
                           |--94.78%-- check_events

1.78%    squid  libc-2.15.so         [.] vfprintf
1.62%    squid  [kernel.kallsyms]    [k] xen_spin_lock
1.44%    squid  libc-2.15.so         [.] __memcpy_ssse3_back

These results show that squid is NOT CPU bound at this point. Neither
is it Network IO bound because i can get much more throughput when I
only run the generator with the server. In this case squid should be
able to do more. Where is the bottleneck coming from?

Your guesses would seem to be in the right direction. Your data should 
contain hints where to look closer. memcpy() and memory paging being so 
high are suspicious hint.

If anyone is interested with very detailed benchmarks, then I can provide them.

Yes please :-)

PS. could you CC the squid-dev mailing list as well with the details. 
The more developer eyes we can get on this data the better. Although 
please do test a current release first, we have significantly changed 
the ACL handling which was one bottleneck in Squid, and have altered the 
mempools use of memset() is several locations in the latest 3.HEAD code.

Amos