> Hi, > > I've noticed a behavior in CARP failover (on 2.7) that I was wondering > if someone could explain. > > In my test environment, I have a non-caching squid configured with > multiple CARP parent caches - two servers, three per box (listening on > ports 1080/1081/1082, respectively, for a total of six servers. > > When I fail a squid instance and immediately afterwards run GETs to > URLs that were previously directed to that instance, I notice that the > request goes to a different squid, as expected, and I see the > following in the log for each request: > > May 6 11:43:28 cdce-den002-001 squid[1557]: TCP connection to http- > cache-1c.den002 (http-cache-1c.den002:1082) failed > > And I notice that the request is being forwarded to a different, but > consistent, parent. > > After ten of the above requests, I see this: > > May 6 11:43:41 cdce-den002-001.den002 squid[1557]: Detected DEAD > Parent: http-cache-1c.den002 > > So, I'm presuming that after ten failed requests, the peer is > considered DEAD. So far, so good. > > The problem is this: During my test GETs, I noticed that immediately > after the "Detected DEAD Parent" message was generated, the parent > server that the request was being forwarded to changed - as if there's > an "interim" decision made until the peer is officially declared DEAD, > and then another hash decision made afterwards. So while consistent > afterwards, it's apparent that during the failover, the parent server > for the test URL changed twice, not once. > > Can someone explain this behavior? Do you have 'default' set on any of the parents? It is entirely possible that multiple paths are selected as usable and only the first taken. During the period between death and detection the dead peer will still be attempted but failover happens to send the request to another location. When death is detected the hashes are actual re-calculated. If anyone wants a task it may be useful to see whether leaving dead peers in the existing hash and omitting the dead peers at the selection time instead of connection time is more responsive like this while reducing the double-change. Amos