On 6/11/2011 9:15 p.m., Justin Lawler wrote:
Hi,
We're running squid 3.1.16 on solaris on a sparc box.
We're running it against an ICAP server, and were testing some scenarios when ICAP server went down, how squid would handle it. After freezing the ICAP server, squid seemed to have big problems.
For reference the exected behaviour is this:
** if squid is configured to allow bypass of the ICAP service
--> no noticable problems. possibly faster response time for clients.
** if squid is configured not to bypass on failures (ie critical ICAP
service)
--> New connections continue to be accepted.
--> All traffic needing ICAP halts waiting recovery, RAM and FD
consumption rises until available resources are full.
--> On ICAP recovery the traffic being held gets sent to it and
service resumes as the results come back.
Once it was back up again, it kept on sending OPTION requests to the server - but squid itself became completely unresponsive. It wouldn't accept any further requests, you couldn't use squidclient against it or doing a squid reconfigure, and was not responding to 'squid -k shutdown', so had to be manually killed with a 'kill -9'.
This description is not very clear. You seem to use "it" torefer to
several different things in first sentence of paragraph 2.
Apparently:
* "it" comes back up again. ... apparently refering to ICAP?
* "it" sends OPTION requests ... apparently referring to Squid now? or
to some unmentioned backend part of the ICAP service?
* squid itself is unresponsive .... waiting for queued requets to get
through ICAP and the network fetch stages perhapse? noting that ICAP may
be slowed as it faces teh spike or waiting traffic from Squid.
We then restarted the squid instance, and it started to go crazy, file descriptors reaching the limit (4096 - previously it never went above 1k during long
"kill -9" causes Squid to terminate before savign teh cache index or
closing the journal properly. Thus on restart the journal is discovered
corrupt and a "DIRTY" rebuild is begun. Scanning the entire disk cache
object by object to rebuild the index and journa contents. This can
consume a lot of FD, for a period of time proportional to the size of
your disk cache(s).
Also, clients can hit Squid with a lot of connections that accumulated
during the outage. Which each have to be processed in full, including
all lookups and tests. Immediately. This startup spike is normal
immediately after a start/restart or reconfigure when all the active
running state is erased and requires rebuilding.
The lag problems and resource/queue overloads can be expected to drop
away relatively quickly as the nromal running state gets rebuilt from
the new traffic. The FD consumption from cache scan will disappear
abruptly when that process completes.
stability test runs), and a load of 'Queue Congestion' errors in the logs. Tried to restart it again, and it seemed to behave better then, but still the number of file descriptors is very big (above 3k).
Any particular queue mentioned?
Amos