On 20/06/2013 8:00 p.m., Ahmed Talha Khan wrote:
Hello All,
I have been trying to benchmark the performance of squid for sometime
now for plain HTTP and HTTPS traffic.
The key performance indicators that i am looking at are Requests Per
Second(RPS), Throughput(mbps) and Latency (ms).
My test methodology looks like this
generator(apache benchmark)<------->squid<------>server(lighttpd)
All 3 are running on seperate VM on AWS.
The specs for all the machines are
8 VCPU @ 2.13 GHZ
16 GB RAM
Squid using 8 SMP workers to utilize all cores
Using 8 workers is probably not a godo idea. The recommended practice is
to use one core per worker and leave at leave one spare core for the
kernels usage. Squid does pass a fair chunk of work to the kernel for
I/O, while each workers will completely max out as many CPU cycles as it
can grab from its own core. If there is no core retained for kernel
usage those two properties will result in CPU contention slowdown as
Squid and kernel fight for cycles.
In all these tests I have made sure that the generator and server are
always more powerful than squid. For latency calculation, Time per
request is calculated with and without squid inline and the difference
between them is taken.
I am using a release 3.HEAD just prior to the release of 3.3.
Then please upgrade to 3.3 stable release or a current 3.HEAD . There
have been a few memory leaks and issues resolved in the time since 3.3
was released which are fixed in the current stable. There are also
additional performance improvements in the current 3.HEAD which will be
in 3.4 when it branches.
I want to share the results with the community on the squid wikis. How
to do that?
We are collecting some ad-hoc benchmark details for Squid releases at
http://wiki.squid-cache.org/KnowledgeBase/Benchmarks. So far this is not
exactly a rigourous testing, although following the methodology for
stats collection (as outline in the intro section) retains consistency
and improves comparability between submissions.
Since you are using a different methodology, please feel free to write
up a new article on it. The details you just posted looks like a good
start. We can offer wiki or static web page, or reference from our
benchmarking page to a blog publication of your own.
If you are intending to publish the results I do highly recommend that
you settle on a packaged and numbered version of Squid so others can
replicate the tests or do additional compartive testing on the same
code. The 3.HEAD is a rolling release that is relatively difficult to
locate the exact sources for any given revision, the numbered packages
can be referenced from our permanent archives in your description.
Some results from the tests are:
Server response size = 200 Bytes
New means keep-alive were turned
Keep-alive mean keep-alive were used with 100 http req/conn
C= concurrent requests
HTTP HTTPS
New
| Keep-Alive New | Keep-Alive
RPS
c= 50 6466 | 20227
1336 | 14461
c= 100 6392 | 21583
1303 | 14683
c = 200 5986 | 21462
1300 | 13967
Throughput(mbps)
c = 50 26 |
82.4 5.4 | 59
c=100 25.8 | 88
5.25 | 60
c=200 24 | 88
5.4 | 58
Latency(ms)
c= 50 7.5 | 2.7
36 | 3.75
c= 100 15.8 | 5.27
80 | 8
c=200 26.5 | 11.3
168 | 18
With this results I profile squid with "perf" tool and got some
results that I could not understand. So my question are related to
them
Thank you. Some very nice numbers. I hope they give a clue to anyone
still thinking persistent connections need to be disabled to improve
performance.
For the HTTS case, the CPU utilization peaks around 90% on all cores
and the perf profiler gives:
24.63% squid libc-2.15.so [.] __memset_sse2
6.13% squid libcrypto.so.1.0.0 [.] bn_sqr4x_mont
4.98% squid [kernel.kallsyms] [k] hypercall_page
|
--- hypercall_page
|
|--93.73%-- check_events
Why is so much time spent in one instruction by squid? and too a
memset instruction! Any pointers?
Squid was originally written in C and still has a lot of memset() calls
around the place clearing memory before use. We have made a few attempts
to track them down and remove unnecessary usages but a lot still remain.
Another attempt was tried in the more recent code, so you may find a
lower profile rating in the current 3.HEAD.
Also check whether you have memory_pools on or off. That can affect the
amount of calls to memset().
Since in this case all CPU power is being used so it is understandable
that the performance cannot be improved here. The problem arises with
the HTTP case.
On the contrary, code improvements can be done to reduce CPU cycle
requirements by Squid, which in turn raise the performance. If your
profiling can highlight things like memset() or Squid functions in the
current consuming large amounts of CPU effort can be targeted at
reducing those occurances for best work/performance gains.
For the plain HTTP case, the CPU utilization is only around 50-60% on
all the cores and perf says:
8.47% squid [kernel.kallsyms] [k] hypercall_page
--- hypercall_page
|--94.78%-- check_events
1.78% squid libc-2.15.so [.] vfprintf
1.62% squid [kernel.kallsyms] [k] xen_spin_lock
1.44% squid libc-2.15.so [.] __memcpy_ssse3_back
These results show that squid is NOT CPU bound at this point. Neither
is it Network IO bound because i can get much more throughput when I
only run the generator with the server. In this case squid should be
able to do more. Where is the bottleneck coming from?
Your guesses would seem to be in the right direction. Your data should
contain hints where to look closer. memcpy() and memory paging being so
high are suspicious hint.
If anyone is interested with very detailed benchmarks, then I can provide them.
Yes please :-)
PS. could you CC the squid-dev mailing list as well with the details.
The more developer eyes we can get on this data the better. Although
please do test a current release first, we have significantly changed
the ACL handling which was one bottleneck in Squid, and have altered the
mempools use of memset() is several locations in the latest 3.HEAD code.
Amos