Re: TCP_HIT/504 problem with small Squid cluster

Amos Jeffries <squid3@xxxxxxxxxxxxx> · Thu, 22 Oct 2009 00:27:56 +1300

Robert Knepp wrote:
Hi - first time poster so be gentle.

Some general info regarding my setup:

0) Running Squid 2.7 in reverse proxy mode
1) Each Squid is configured to use it's local webserver on 127.0.0.1
as the origin server and the other servers in the farm as siblings
2) This Squid cache is transparent to the end-user (although I do pass
along a select few cache controls such as if-none-match).

bad word! bad word! Squid in this context is a "reverse proxy". To all 
intents and visibility of the client they are the web server.

The various common meanings of "transparent" has nothing to do with it.

3) It is protected behind local AUTH applications which perform
complex access checks before passing the request onto Squid

You might be able to reduce your server box overheads by merging that 
into a Squid auth helper. This may or may not be a big issue to do so.

4) All documents will be requested and cached as
[http://127.0.0.1/URL] so Squid is really only serving a single domain

************************************************************************************************
Transparent Proxy Cluster

                                       [user agent]

                                            |
                                            v

                                     [Load Balancer]

                                            |
                                            |
     -------------------------------------------------------------------------------
     |                         |                         |
            |
     v                         v                         v
            v

[WEB1-AUTH]               [WEB2-AUTH]               [WEB3-AUTH]
       [WEB4-AUTH]

     |                         |                         |
            |
     v                         v                         v
            v

 [SQUID1]       (icp)      [SQUID2]       (icp)      [SQUID3]
(icp)      [SQUID4]

     |                         |                         |
            |
     v                         v                         v
            v

[WEB1-ORIG]               [WEB2-ORIG]               [WEB3-ORIG]
       [WEB4-ORIG]

************************************************************************************************

Here is a simplified squid.conf from the first server (all others have
the same settings except the sibling list is shifted).

#------
http_port 3128 act-as-origin accel vhost http11

The use of vhost here forces Squid to process the Host: header and cache 
URLs with its content as the domain name.

To meet criteria (4) "All documents ... cached a  [http://127.0.0.1/URL]";

You need to be using:

  http_port 80 act-as-origin accel http11 dstdomain=127.0.0.1

icp_port 3130
cache_dir ufs /cache/data 2048 16 256

aufs please.

cache_mem 8 GB
request_timeout 5 seconds
persistent_request_timeout 5 seconds
refresh_pattern .       0       20%     4320
negative_ttl 0

acl all src all
acl localhost src 127.0.0.1/xx

WTF? why fudge a mask value that is only relevant to a sealed 
machine-local IP address?

acl localnet src 127.0.0.1/xx
acl localnet src xxxxxxxxxxxxx
acl Safe_ports port 3128
acl Safe_ports port 80
http_access allow localhost
http_access deny !Safe_ports
http_access allow localnet
http_access deny all
icp_access allow localnet
icp_access deny all

## Origin server
cache_peer 127.0.0.1 parent 80 0 name=localweb max-conn=250 no-query
no-netdb-exchange originserver http11
cache_peer_access localweb allow localnet
cache_peer_access localweb deny all
## Sibling Caches
#   cache_peer [IP_OF_SIBLING_1] sibling 3128 3130 proxy-only
cache_peer [IP_OF_SIBLING_2] sibling 3128 3130 proxy-only
cache_peer [IP_OF_SIBLING_3] sibling 3128 3130 proxy-only
cache_peer [IP_OF_SIBLING_4] sibling 3128 3130 proxy-only

#1 rules of reverse proxies:
   If the reverse-proxy rules are not above the generic forward-proxy 
rule they risk false error pages.

************************************************************************************************
<snip duplicate paste>

************************************************************************************************

So......  I have a 'few' questions regarding my setup and how I might
be able to improve on it.

- Does the ICP sibling setup makes sense or will it limit the number
of servers in the cluster? Or should this be redesigned to work with
multiple parent caches instead of siblings? Or perhaps multicast ICP?
Or I could try digests?

You want it to be scalable AND fast? multicast or digests.

You want to maximize bandwidth capacity? digests or CARP.

- Would using 'icp_hit_stale' and 'allow-miss' improve hit-ratios
between the shards? Is there a way to force a given Squid server to be
the ONLY server storing a cached document (stale, fresh, or
otherwise)?

icp_hit_stale  allows peers to say "I have it!" when what they really 
have is an old stale copy. Useful of the peer is close and the object 
can be served stale while a better one is fetched. Bad if it causes 
spreading of non-cacheable objects.

allow-miss  allows peers to send the "I have it" message on stale 
objects and fetch a new copy from their fast source when they are asked 
for the full thing. Thus refreshing the object in two caches instead of 
just one. Mitigating the total effect of having that one fetch be extra 
slow.

- Using this basic setup for about a month now and I am getting
strange squid access.log entries when the load goes up:

2009-04-04 11:13:47 504 GET "http://127.0.0.1:3128/[URL]"; TCP_HIT NONE
3018 0 "127.0.0.1" "127.0.0.1:3128" "-" "-"

This is due to your website being hosted on 127.0.0.1 port 3128.

The Host: header contains domain:port unless the port is the http 
default port 80.

The new http_port line I gave you above should fix this as a by-product.

  The gateway timeout lines appear during high load and are usually
(but not always) close to UDP_HIT entries on the same URI. In most
cases like this the document gets returned to the user with a status
of 200. It confused me since I thought TCP_HIT represented a cache
object was found locally and is being served.
  Or maybe it is related to a false UDP_HIT? Could this be network
related? Or can a slow response from the origin server cause this?

UDP_HIT - a sibling requested the object via ICP and was sent a positive 
answer that the object is stored in cache.

TCP_HIT - a client requested an object and was provided an object from 
cache.

http://wiki.squid-cache.org/SquidFaq/SquidLogs#Hierarchy_Codes

- Is 8 GB cache memory (out of 32GB in each box) going to cause
problems for Squid? And what happens if it fills up quickly?

Things get shoved out of the memory cache if too old or moved to disk 
cache if still possibly usable.

Anyway,I just wanted to throw out these questions for the experts.
I'll likely be trying some of these changes just to see the effects.

Feel free to chime in on any of this stuff.
Thanks in advance,
Rob

Amos
--
Please be using
  Current Stable Squid 2.7.STABLE7 or 3.0.STABLE19
  Current Beta Squid 3.1.0.14