Squid performance issues

Felipe W Damasio <felipewd@xxxxxxxxx> · Tue, 26 Jan 2010 00:37:16 -0200

 Hi all,

 Sorry for the long email.

 I'm using squid on a 300Mbps ISP with about 10,000 users.

 I have an 8-core I7 Intel processor-machine, with 8GB of RAM and 500
of HD for the cache. (exclusive Sata HD with xfs). Using aufs as
storeio.

 I'm caching mostly multimedia files (youtube and such).

 Squid usually eats around 50-70% of one core.

 But always around midnight (when a lot of users browse the internet),
my squid becomes very slow....I mean, a page that usually takes 0.04s
to load takes 23seconds to load.

 My best guess is that the volume of traffic is making squid slow.

 I'm using a 2.6.29.6 vanilla kernel with tproxy enabled for squid.
And I'm using these /proc configurations:

echo 0 > /proc/sys/net/ipv4/tcp_ecn
echo 1 > /proc/sys/net/ipv4/tcp_low_latency
echo 100000 > /proc/sys/net/core/netdev_max_backlog
echo 409600  > /proc/sys/net/ipv4/tcp_max_syn_backlog
echo 7 > /proc/sys/net/ipv4/tcp_fin_timeout
echo 15 > /proc/sys/net/ipv4/tcp_keepalive_intvl
echo 3 > /proc/sys/net/ipv4/tcp_keepalive_probes
echo 65536 > /proc/sys/vm/min_free_kbytes
echo "262144 1024000 4194304" > /proc/sys/net/ipv4/tcp_rmem
echo "262144 1024000 4194304" > /proc/sys/net/ipv4/tcp_wmem
echo "1024000" > /proc/sys/net/core/rmem_max
echo "1024000" > /proc/sys/net/core/wmem_max
echo "512000" > /proc/sys/net/core/rmem_default
echo "512000" > /proc/sys/net/core/wmem_default
echo "524288" > /proc/sys/net/ipv4/netfilter/ip_conntrack_max
echo "3" > /proc/sys/net/ipv4/tcp_synack_retries

 The machine is in bridge-mode.

 I wrote a little script that prints:

 - The date;
 - The "/usr/bin/time squidclient http://www.amazon.com";;
 - The number of ESTABLISHED connections (through netstat -an);
 - The number of TIME_WAIT connections;
 - The total number of netstat connections;
 - The route cache (ip route list cache);
 - The number of clients currently connected in squid (through mgr:info);
 - The number of free memory in MB (free -m);
 - The % used of the squid-running core;
 - The average number of time to respond a request / sec (mgr:info
also) - 5 minutes avg;
 - The average number of http requests / sec (5 minutes avg) - mgr:info as well.

 On any other hour, I have something like:

2010-01-25 18:48:19 ; 0.04 ; 19383 ; 9902 ; 29865 ; 96972 ; 4677 ; 131
; 59 ; 0.24524 ; 476.871718
2010-01-25 18:53:29 ; 0.04 ; 18865 ; 8593 ; 30123 ; 179570 ; 4679 ;
148 ; 62 ; 0.22004 ; 504.424207
2010-01-25 18:58:38 ; 0.04 ; 18377 ; 9056 ; 29283 ; 99038 ; 4680 ; 174
; 61 ; 0.22004 ; 466.659336
2010-01-25 19:03:49 ; 0.04 ; 18877 ; 9133 ; 28327 ; 181196 ; 4673 ;
171 ; 57 ; 0.24524 ; 483.558436

 So, it takes around 0.04s to get http://www.amazon.com.

2010-01-24 23:46:50 ; 2.53 ; 22723 ; 9861 ; 35012 ; 64752 ; 4306 ;
166; 70 ; 0.22004 ; 566.364274
2010-01-24 23:52:04 ; 3.74 ; 21173 ; 10256 ; 33242 ; 167594 ; 4309 ;
169 ; 68 ; 0.20843 ; 537.758601
2010-01-24 23:57:20 ; 0.08 ; 18691 ; 9050 ; 29590 ; 65496 ; 4312 ; 138
; 71 ; 0.20843 ; 525.119006
2010-01-25 00:02:29 ; 15.54 ; 18016 ; 8209 ; 29035 ; 149248 ; 4318 ;
160 ; 82 ; 0.25890 ; 491.615241

 As I said, it goes from 0.04 to 15.54s(!) to get a single html file.
Horrible. After 12:30, everything goes back to normal.

 From those variables, I can't seem to find any indication of what can
be causing this appalling slowdown. The number of squid users doesn't
go up that much, I just see that the avg time squid reports to
answering a request goes from 0.20s to 0.25, and the number of http
requests/sec actually goes down from 566 to 491...which is kind of odd
to me. And the number users using squid stays in aroung 4300.

 I talked to Mr. Dave Dykstra, and he thought it could be I/O delay
issues. So I tried:

cache_dir null /tmp
cache_access_log none
cache_store_log none

  But no luck, on midnight tonight again things went wild:

2010-01-25 23:57:03 ; 0.04 ; 24112 ; 11330 ; 37240 ; 74456 ; 3516 ;
160 ; 58 ; 0.25890 ; 581.047037
2010-01-26 00:02:15 ; 10.82 ; 25638 ; 11695 ; 38537 ; 177198 ; 3533 ;
149 ; 78 ; 0.27332 ; 570.312936
2010-01-26 00:07:38 ; 42.64 ; 23818 ; 11563 ; 38097 ; 88902 ; 3556 ;
171 ; 70 ; 0.30459 ; 585.880418

  From 0.04 to 42 seconds to load the main html page of amazon.com. (!)

  Do you have any idea or any other data I can collect to try and
track down this?

  I'm using squid-2.7.stable7, but I'm willing to try squid-3.0 or
squid-3.1 if you guys think it could help.

  I'm using 2 gigabit Marvell Ethernet boards with sky2 driver. Don't
know if it's relevant, though.

  If you guys need any more info to try and help me figure this out, please ask.

  I'm willing to test, code or do pretty much anything to make squid
perform better on my environment Please let me know how can I help you
help me. :-)

  Thanks!

Felipe Damasio