Re: detecting dead parent problem

Amos Jeffries <squid3@xxxxxxxxxxxxx> · Tue, 07 May 2013 14:56:02 +1200

On 7/05/2013 3:16 a.m., Rietzler, Markus (RZF, SG 324 / 
<RIETZLER_SOFTWARE>) wrote:
we have a setup with one squid (user-proxy) that connects to 4 parent proxies.

cache_peer proxy-inter1 parent 8083 0 sourcehash no-query no-digest no-netdb-exchange connection-auth=off
cache_peer proxy-inter2 parent 8083 0 sourcehash no-query no-digest no-netdb-exchange connection-auth=off
cache_peer proxy-inter3 parent 8083 0 sourcehash no-query no-digest no-netdb-exchange connection-auth=off
cache_peer proxy-inter4 parent 8083 0 sourcehash no-query no-digest no-netdb-exchange connection-auth=off

recently two of those 4 parents were gone. in cache log we saw messages like:

2013/05/06 16:27:33 TCP connection to proxy-inter4/8083 failed

and then after 10s or so (which should be the dead_parent_timeout)

2013/05/06 16:27:34 Detected DEAD Parent: proxy-inter4

that seems to be normal.

BUT
1) those messages reappear in cache.log again and again. normally we would expect them not to come at all unless the parent is detected as live again. many "TCP connection failed" and some times "DEAD parents"
2) browsing the web was extremely SLOW

we use squid 3.2.4 as user-proxy and the 4 parent proxies.

configure options:  '--enable-auth-basic=MSNT,SMB' '--enable-external-acl-helpers=ldap_group' '--enable-auth-basic' '--enable-auth-ntlm' '--enable-auth-negotiate=kerberos' '--enable-delay-pools' '--enable-follow-x-forwarded-for' '--enable-removal-policies=lru,heap' '--with-filedescriptors=4096' '--with-winbind' '--with-async-io' '--enable-storeio=ufs,aufs,diskd,rock' '--disable-ident-lookups' '--prefix=/www/squid' '--enable-underscores' '--with-large-files' 'PKG_CONFIG_PATH=/opt/gnome/lib64/pkgconfig:/opt/gnome/share/pkgconfig' --enable-ltdl-convenience

top on the two living parent proxies was ok.

we also have two development systems. one running squid 2.7.3 and one 3.2.9. the one with 3.2.9 showed some problems. many log entries in cache log and SLOW browsing. on the old squid browsing was no problem at all. all requests were fast enough. the old squid showed no messages in cache log after "DEAD parent". on both development systems only few (2-3) users were active.

any idea were to look?

Start with removing "no-query" from the cache_peer lines. The one of the 
main purposes of proxy queries is to determine UP/DEAD status. You can 
also tune the connection-fail-limit= option on cache_peer to reduce the 
number of failed requests before the peer is declared DEAD.

FYI: 3.2 forwarding path algorithm has been altered a fair bit in a way 
which might account for the behaviour change. Namely DNS is only looked 
up once per path available, and re-tries are done sequentially down the 
resulting set of IPs - 3.1 and older would do DNS lookups on every 
re-try so you would easily get the 10 failed connects in a few ms while 
retrying a single request which never gets through. In 3.2 you will get 
10 *different* requests trying the peer over a slightly longer time 
(better chance of short-outage recovery detection) and getting serviced 
by a later path (hopefully more successful, and definitely less lag on 
errors than before).

You are using sourcehash, which is an algorithm that only produces *1* 
cache_peer as available for servicing a request. The behaviour change 
above will result in only that peer IPs being tested on a request before 
other paths like DIRECT/DNS being tried. The hash will *not* be 
re-calculated for the request which failed to reach the peer.

Amos