On 2014-02-21 06:10, Simon Beale wrote:
I've got a problem at the moment with our general squid proxies where
occasionally requests take a long time that shouldn't do. (i.e. 5+
seconds
or timeout, instead of milliseconds).
This is most common on our proxies doing 100 reqs/sec, but happens
overnight too when they're running at 10 reqs/sec. I've got this
happening
with both v3.4.2 and also with a box I've downgraded back to v3.1.10.
For
v3.4.2, it's happening in both multiple worker and single worker modes.
What sort of CPU loading do you have at ~100req/sec?
is that at or near your local installations req/sec capacity?
NP:
* slow-down at peak capacity is normal as the proxy is busy servicing
other traffic.
* slow-down at only a few req/sec is normal as Squid spends a lot of
its time in artificial I/O wait delays to prevent reading/writing
individual bytes off the network. Nothing worse for the network than to
have ~71 bytes of packet overhead for every 2 bytes of data transferred.
* slow-down randomly all the time could be network congestion, Window
scaling, ECN or MTU related. even ICMp related (ICMP is *not* optional -
though many admin block it).
* then there is bugs.
- 3.1 had a few IPv6 bugs (some major) which caused TCP retry delays in
certain circumstances. Since you are seeing it only randomly I would
suspect remote network(s) somewhere with those issues being a transit
hop occasionally. Though this is unlikely given 3.4 still shows it.
- There is a fix in the 3.4.3 release regarding connection IP failover
that may help if that is part of the issue (or it may not).
The test is not reproducible, sadly, but I've got a cronjob running on
localhost on these boxes testing access times to various URLs covering:
HTTPS, non-HTTPS static content, using IP not hostname over both HTTP
and
HTTPS, and a URL on the same vlan as the proxies. All of these test
cases
have it happen occasionally, but not repeatedly/reliably.
Some ideas:
* DNS lookup delays ?
* Random TCP connection setup delays?
Different boxes are either running Trend's IWSVA for it's antivirus as
a
cache_peer, or C-ICAP/clamd as an ICAP service. These both have it
happen
(as does the case where I disabled the antivirus).
* object size related? ie scanning time in the AV.
The servers are all running CentOS6.4 on HP Gen8 blades with 48G RAM.
Has anyone seen anything like this, or got any suggestions as to what
might be causing this that I can investigate further?
Simon
Lots of people see it for all sorts of reasons.
Amos