Re: TIMEOUT_ROUNDROBIN_PARENT and poor SIBLING_HIT performance

"M. Leong Lists" <leongmzlist@xxxxxxxxx> · Thu, 24 Feb 2011 11:35:22 -0800

The LB does periodic health checks of the backend and marks out any 
backend not responding in time.  Would you recommend using squid to 
connect directly to the backend and use the monitorurl parameter 
instead?  The origin servers are on the same subnet as the squid cluster.

I turned of connection persistence so squid and clients re-connect 
faster if a backend died.   From past experience, if persistence was 
enabled, it would take a lot longer for the connections to break and go 
to another server.  Is there a config where I can set how long to 
timeout? (read_timeout parameter?)

thx
mike

On 02/23/2011 04:34 PM, Amos Jeffries wrote:
On Wed, 23 Feb 2011 14:55:25 -0800, M. Leong Lists wrote:
Hi,

I've 2 problems where squid is taking excessive time to service a 
request.

My setup:
-Accelerator setup
-backends are on load balancer, squid is configured to connect to the
load balancer IP multiple times
-squid's configured to store the cache as long as possible.
-icp time is set to really high, otherwise some siblings doesn't
respond in time.  Should this be lowered?

Think about that a bit:
 * the sibling is taking a very long time to respond to a single ICP 
packet.
   What do you think the speed will be like to it when you send a 
whole bunch of request and reply packets? better/same/worse?

So in the end do you think it is a better idea to ICP-timeout and mark 
the peers as down/unusable fast and move on to the alternatives? or to 
keep waiting?

Version:
Squid Cache: Version 2.7.STABLE9
configure options:  '--prefix=/apps/squid'
'--enable-x-accelerator-vary' '--enable-linux-netfilter'
'--enable-cache-digests' '--enable-htcp' '--enable-snmp'
'--enable-referer-log' '--enable-useragent-log' '--enable-delay-pools'
'--enable-icmp' '--enable-async-io=500' '--with-maxfd=10240'
'--enable-removal-policies=lru,heap' '--enable-follow-x-forwarded-for'
'--enable-epoll' '--with-large-files'

Relevant config:

http_port 80 vhost defaultsite=cache.example.com
cache_mem 512 MB

cache_peer lb.example.com parent 80 0 round-robin no-query
originserver no-netdb-exchange no-digest name=lb_01

... <snip>

cache_peer lb.example.com parent 80 0 round-robin no-query
originserver no-netdb-exchange no-digest name=lb_10

So you are manually load-balancing the way connections are made to a 
load balancer. WHY? what happens if you remove these duplicate peer 
links?

NP: squid defaults to 10 connection attempts to each peer before it 
gives up. So you have potentially a grand total of 100 TCP connections 
made through the LB before the request fails.

Update the LB to only make connection attempts to working sources and 
use it once by Squid. If it is already doing that smart logics, this 
configuration setup is not of much use.

Or if the LB is not smart enough to do that kind of control it is of 
less use than the built in load-balancing which Squid does. Discard it 
and just use the round-robin selection directly to the peers behind 
the LB. All the problems you have with end-to-end path discovery, 
connection up/down status and persistence will disappear.

cache_peer cache01.example.com sibling 80 3130 proxy-only no-delay
allow-miss weight=1 no-netdb-exchange no-digest name=cache01

..<snip>

cache_peer cache08.example.com sibling 80 3130 proxy-only no-delay
allow-miss weight=1 no-netdb-exchange no-digest name=cache08

client_persistent_connections off
server_persistent_connections off

The above will be part of your lag problem. I know why you do it, 
separating persistent connections and load balancing do not work 
together very easily. Just saying that it will be a factor in the 
problem.
Your Squid is reduced to a pure HTTP/1.0 level of efficiency with TCP 
handshakes (possibly multiple) being done with every single client 
request. All the HTTP/1.1 efficiency features to maintain long-term 
persistent connections become a net loss of performance when 
connections are forced closed all the time.

You should be able to re-enable persistent connections to clients 
without problem. Given a reasonable timeout this will enable clients 
to pipeline requests through the connection to Squid without leaving 
them unused for long periods. It has no effect on the server facing 
connections and their LB.

digest_generation off

icp_access allow all
icp_hit_stale on
icp_query_timeout 7000
maximum_icp_query_timeout 10000
nonhierarchical_direct off
url_rewrite_host_header off

offline_mode on
--------------------------------------

TIMEOUT_ROUNDROBIN_PARENT

All the TIMEOUT requests took at least 7000 ms, which is the value of
icp_query_timeout.  Some requests took at over 30 sec to complete.  I
crossed referenced those long requests against the backends and notice
a big mismatch in the times.  The backends are tomcat apps w/ Java
1.6.  I extracted the times from the tomcat access log.

Squid Time    Backend time:
7922            924
8422            1421
7488            487
12835           5833
25098           18096
34793            611
21806            14804

Time difference will be multiplied by the time Squid spends waiting 
for a TCP handshake to occur on every connection. This is the full RTT 
of three packets to cycle Squid->LB->tomcat and back again.  Multiple 
that by the 10-100 new connections your Squid is configured to make to 
the LB before aborting with failure.

------------------------------------
High SIBLING_HIT response time:

The same problem occurs with sibling hits.  The logged process time
on the sibling and the one requesting from the sibling vastly differs:

Squid Time    Time on Sibling's Log
4534            30
23994           12959
6661            40

---------------
Does anyone know of a reason why it would take so long for squid to
complete a request??

mike

Thats all I can think of off the top of my head, maybe more later.
Good luck.

Amos