On 7/05/2013 3:16 a.m., Rietzler, Markus (RZF, SG 324 /
<RIETZLER_SOFTWARE>) wrote:
we have a setup with one squid (user-proxy) that connects to 4 parent proxies.
cache_peer proxy-inter1 parent 8083 0 sourcehash no-query no-digest no-netdb-exchange connection-auth=off
cache_peer proxy-inter2 parent 8083 0 sourcehash no-query no-digest no-netdb-exchange connection-auth=off
cache_peer proxy-inter3 parent 8083 0 sourcehash no-query no-digest no-netdb-exchange connection-auth=off
cache_peer proxy-inter4 parent 8083 0 sourcehash no-query no-digest no-netdb-exchange connection-auth=off
recently two of those 4 parents were gone. in cache log we saw messages like:
2013/05/06 16:27:33 TCP connection to proxy-inter4/8083 failed
and then after 10s or so (which should be the dead_parent_timeout)
2013/05/06 16:27:34 Detected DEAD Parent: proxy-inter4
that seems to be normal.
BUT
1) those messages reappear in cache.log again and again. normally we would expect them not to come at all unless the parent is detected as live again. many "TCP connection failed" and some times "DEAD parents"
2) browsing the web was extremely SLOW
we use squid 3.2.4 as user-proxy and the 4 parent proxies.
configure options: '--enable-auth-basic=MSNT,SMB' '--enable-external-acl-helpers=ldap_group' '--enable-auth-basic' '--enable-auth-ntlm' '--enable-auth-negotiate=kerberos' '--enable-delay-pools' '--enable-follow-x-forwarded-for' '--enable-removal-policies=lru,heap' '--with-filedescriptors=4096' '--with-winbind' '--with-async-io' '--enable-storeio=ufs,aufs,diskd,rock' '--disable-ident-lookups' '--prefix=/www/squid' '--enable-underscores' '--with-large-files' 'PKG_CONFIG_PATH=/opt/gnome/lib64/pkgconfig:/opt/gnome/share/pkgconfig' --enable-ltdl-convenience
top on the two living parent proxies was ok.
we also have two development systems. one running squid 2.7.3 and one 3.2.9. the one with 3.2.9 showed some problems. many log entries in cache log and SLOW browsing. on the old squid browsing was no problem at all. all requests were fast enough. the old squid showed no messages in cache log after "DEAD parent". on both development systems only few (2-3) users were active.
any idea were to look?
Start with removing "no-query" from the cache_peer lines. The one of the
main purposes of proxy queries is to determine UP/DEAD status. You can
also tune the connection-fail-limit= option on cache_peer to reduce the
number of failed requests before the peer is declared DEAD.
FYI: 3.2 forwarding path algorithm has been altered a fair bit in a way
which might account for the behaviour change. Namely DNS is only looked
up once per path available, and re-tries are done sequentially down the
resulting set of IPs - 3.1 and older would do DNS lookups on every
re-try so you would easily get the 10 failed connects in a few ms while
retrying a single request which never gets through. In 3.2 you will get
10 *different* requests trying the peer over a slightly longer time
(better chance of short-outage recovery detection) and getting serviced
by a later path (hopefully more successful, and definitely less lag on
errors than before).
You are using sourcehash, which is an algorithm that only produces *1*
cache_peer as available for servicing a request. The behaviour change
above will result in only that peer IPs being tested on a request before
other paths like DIRECT/DNS being tried. The hash will *not* be
re-calculated for the request which failed to reach the peer.
Amos