Re: Sibling Problem

Ding Deng <ding.deng@xxxxxxxxx> · Thu, 23 Aug 2007 16:40:15 +0800

Hi Andre, hi all:

Let's have another try ;-)

Suppose we have three Squid boxes in a cluster (let's call them A, B and
C, respectively), all configured to talk to each other through ICP. Here
is the problem we met:

Client sent an HTTP request to A. A did not have the corresponding
object in his local cache, so he queried B and C through ICP. Sibling B
replied with an UDP_MISS, which was a normal behavior. What confused us
was what machine C did:

    1186606560.930 0 IP_OF_MACHINE_A UDP_HIT/000 115 ICP_QUERY
    http://www.example.com/dynamic.js - NONE/- -

    1186606560.932 0 IP_OF_MACHINE_A TCP_MISS/504 1949 GET
    http://www.example.com/dynamic.js - NONE/- text/html

What following UDP_HIT was a TCP_MISS/504, which means that machine C
had that object in his local cache, but A failed to fetch it due to some
weird timeout error.

I'm not sure where this 504 came from, and I don't think it's a
configuration problem, becase it was just 2 ms later than the
corresponding UDP_HIT message, and I have never set any timeout related
value to that extreme.

Then, machine C released the object (504 error message instead of the
expected content?) from memory:

      1186606560.932 RELEASE -1 FFFFFFFF
      381F892DF3928A903A3DF921D2FF27A9 504 1186606560 0 1186606560
      text/html 1650/1896 GET http://www.example.com/dynamic.js

Below are the corresponding logs from machine A:

access.log:

        1186606561.024 93 IP_OF_CLIENT_MACHINE TCP_MISS/200 10939 GET
        http://www.example.com/dynamic.js - DIRECT/IP_OF_BACKEND_SERVER
        application/x-javascript

store.log:

        1186606561.024 RELEASE -1 FFFFFFFF
        9771AFBBB9036CA86486A7DE01F33538 200 1186606560 -1 1186649760
        application/x-javascript -1/10675 GET
        http://www.example.com/dynamic.js

Which means machine A fetched the object from backend server, served it
to the requesting client, and then released it from memory
*immediately*.

Squid-2.5.STABLE14[1] on Linux 2.6.18-4-amd64; A, B and C are all
connected to the same switch, so there is little chance for that to be a
network problem.

Timeout related settings:

        icp_query_timeout 50
        maximum_icp_query_timeout 50

        forward_timeout 4 minutes
        connect_timeout 1 minute
        peer_connect_timeout 30 seconds
        read_timeout 15 minutes
        request_timeout 5 minutes
        persistent_request_timeout 1 minute
        pconn_timeout 120 seconds

Anyone has any clue? Thanks very much!

- Ding Deng

[1] Yes, we know that we should try v2.6 first and see if the problem
still occurs, but it's difficult to do that in a production environment
(you know that, right? ;-), and our boss is way harder to persuade than
you may imagine ;-(

"andre wang" <andre.ease@xxxxxxxxx> writes:

> HI ALL:
>
>   We are running Squid 2.5STABLE14 on Linux machines trying to run a
> cluster of caches in a siblings peering arrangement using multicast
> for ICP queries. The caches seem to be talking to each other fine.
>
> When the client sends a HTTP requested that isn't cached on the
> configured cache, the cache sends out an ICP multicast query, all
> other caches recieve this fine and respond. Either with UDP_MISS or
> UDP_HIT. The problem is, if the other caches respond with a UDP_HIT
> the orginal cache still fetches the object directly, rather than
> fetching the object from the sibling. Why?
>
> And I have checked the access.log, got these:
>
> On the first cache (172.19.0.229) 1187773057.113 3 222.220.132.48
> TCP_MISS/200 315 GEThttp://XXXXX - DIRECT/XXXX
>
> On the sibling cache (172.19.0.228) 1187773057.002 0 172.19.0.229
> UDP_HIT/000 108 ICP_QUERYhttp://XXXXXX - NONE/- -
>
> Any idear?
> Thanks