RE: Squid Performance (with Polygraph)

"Dave Raven" <dave@xxxxxxxxxxxx> · Fri, 9 Nov 2007 15:26:00 +0200

Hi all,
	Okay I managed to do a lot more testing at the office today. Firstly
some of the questions asked --

CPU Usage: The cpu usage is around 30% during the test, when the unit begins
to fail it actually goes down a bit. 

Mbufs/Clusters: All fine - they do rise quickly after the problem happens
but this is because the established network connections are still coming in
600 a second, but only being satisfied at a rate of say 200 a second. The
send queues then get big, and mbuf usage goes up - this is not the cause of
the failure though, it's a side effect. For the first x minutes its between
250 and 3000 mbufs (and clusters) used, and my max is 65k/32k

As for system logs there are none - there is nothing suspicious anywhere
until the side effects kick in, e.g. mbufs running out etc. Squid also logs
nothing at all. I've also checked if I'm using too much memory and that's
not the case - swap is not used at all during the entire test. 

This is the process of what happens --

1. PolyClt + PolySrv begin, 800 RPS. 

2. ESTABLISHED netstat connections are around 2000 once 800RPS is reached
(about 20 seconds). CPU load is 30%, mbufs are available etc.

3. Once memory becomes full (quickly) disk drive usage begins - squid -z
puts the TPS per drive at well over 1000/s when I run it, when the cache is
doing 800 RPS the tps is about 30 per drive (low..). 

4. After a period of time (almost always the same (+/- 60 seconds) depending
on RPS) the ESTABLISHED connections start rising, at the exact same time the
PolyClt starts showing less RPS. This is the "slow down" as such. 

5. Because of this, polyclt continues to send requests which the unit
continues to accept - quickly all available sockets are used, and the unit
will then crash

Interestingly enough though - if I stop the polyclt when this happens and
restart it - in under 10 seconds - it continues on for another x minutes
without problem. If I leave it running the unit never comes right.

I have used "systat -vmstat 1", "systat -tcp 1", "systat -iostat 1" and all
the stats from Munin, and a MRTG graphing config for squid and they all show
nothing of interest. The only result that changes between working time and
slow down is that the connections go through the roof as explained above...

I have also seen it fail at 300RPS, but only after 82 minutes - which seems
like a very long time if it was going to fail because of disk load. The
entire time the disks are very underloaded. That said, if I use a null cache
directory this doesn't happen....

I know that sounds like its clearly drives - but 82 minutes ??

Thanks for all the help
Dave

-----Original Message-----
From: Adrian Chadd [mailto:adrian@xxxxxxxxxxxxxxx] 
Sent: Friday, November 09, 2007 11:55 AM
To: Dave Raven
Cc: 'Adrian Chadd'; squid-users@xxxxxxxxxxxxxxx
Subject: Re:  Squid Performance (with Polygraph)

Check netstat -mb and see if you're running out of mbufs?
You haven't mentioned whether the CPU is being pegged at this point?

Adrian

On Fri, Nov 09, 2007, Dave Raven wrote:
> Hi all,
> 	Okay I've done some of what you requested, and unfortunately failed
> to find anything specific. I can pretty much guarantee the times at which
> the requests will slow down now. 600RPS = 15 minutes, 800 RPS = 11
minutes,
> 400 RPS = ~80 minutes. 
> 
> During that time (before and during the problem) systat -vmstat 1 shows
the
> same interrupts - about 4000 on em1 (ifac) and 250 on hptmv0 - my
controller
> for the SATA drives. 
> 
> If I use a systat -iostat 1 I can see that none of the drives are 100%
> utilized at any time during the test. Systat -tcp 1 also doesn't show me
> anything out of the ordinary. I have setup munin to monitor the host but
> unfortunately its not showing much. 
> 
> Also the problem is that when the problem begins, it starts filling up
> network connections - once it fills all the available ports nothing can
> monitor it :/
> 
> I'm going to try use a different network card, then a different
motherboard
> etc - try some different setups today. Thanks again for all the help and
> please let me know if anyone has any ideas...
> 
> Thanks
> Dave
> 
> -----Original Message-----
> From: Adrian Chadd [mailto:adrian@xxxxxxxxxxxxxxx] 
> Sent: Friday, November 09, 2007 4:08 AM
> To: Dave Raven
> Cc: squid-users@xxxxxxxxxxxxxxx
> Subject: Re:  Squid Performance (with Polygraph)
> 
> On Thu, Nov 08, 2007, Dave Raven wrote:
> > Hi Adrian,
> >  What would cause it to fail after a specific time though - if the
> cache_mem
> > is already full and its using the drives? I would have thought it would
> fail
> > immediately ? 
> > 
> > Also there are no log messages about failures or anything...
> 
> Who knows :) its hard without having remote access, or lots of logging/
> statistics to correlate the trouble times with.
> 
> Try installing munin and graph all the system-specific stuff. See what
> correlates against the failure time. You might notice something, like
> out of memory/paging, or an increase in interrupts, or something. ;)
> 
> Thats all I can offer at the present time, sorry.
> 
> 
> 
> Adrian
> 
> -- 
> - Xenion - http://www.xenion.com.au/ - VPS Hosting - Commercial Squid
> Support -
> - $25/pm entry-level VPSes w/ capped bandwidth charges available in WA -

-- 
- Xenion - http://www.xenion.com.au/ - VPS Hosting - Commercial Squid
Support -
- $25/pm entry-level VPSes w/ capped bandwidth charges available in WA -