Re: Squid Performance (with Polygraph)

Marcello Romani <mromani@xxxxxxxxxxxxxxx> · Fri, 09 Nov 2007 17:45:27 +0100

Dave Raven ha scritto:
Hi Adrian,

	It works for the full 4 hours with a null cache directory. How would
I see any kind of stats/information on disk IO? From the stats I can see so
far, the disk stats don't change at all when it fails ...

I'm currently using COSS, but I've also tried this with ufs and diskd (with
the same results, just different times that it fails after).

Thanks
Dave

-----Original Message-----
From: Adrian Chadd [mailto:adrian@xxxxxxxxxxxxxxx] 
Sent: Friday, November 09, 2007 3:35 PM
To: Dave Raven
Cc: squid-users@xxxxxxxxxxxxxxx
Subject: Re:  Squid Performance (with Polygraph)

Rightio; this reads like you're running out of disk IO.
Try running the test with a null cache dir and make sure the box can handle
that load.

Squid unfortunately had crap disk IO code for whats available these days.

Adrian

On Fri, Nov 09, 2007, Dave Raven wrote:
Hi all,
	Okay I managed to do a lot more testing at the office today. Firstly
some of the questions asked --

CPU Usage: The cpu usage is around 30% during the test, when the unit
begins
to fail it actually goes down a bit. 

Mbufs/Clusters: All fine - they do rise quickly after the problem happens
but this is because the established network connections are still coming
in
600 a second, but only being satisfied at a rate of say 200 a second. The
send queues then get big, and mbuf usage goes up - this is not the cause
of
the failure though, it's a side effect. For the first x minutes its
between
250 and 3000 mbufs (and clusters) used, and my max is 65k/32k

As for system logs there are none - there is nothing suspicious anywhere
until the side effects kick in, e.g. mbufs running out etc. Squid also
logs
nothing at all. I've also checked if I'm using too much memory and that's
not the case - swap is not used at all during the entire test. 

This is the process of what happens --

1. PolyClt + PolySrv begin, 800 RPS. 

2. ESTABLISHED netstat connections are around 2000 once 800RPS is reached
(about 20 seconds). CPU load is 30%, mbufs are available etc.

3. Once memory becomes full (quickly) disk drive usage begins - squid -z
puts the TPS per drive at well over 1000/s when I run it, when the cache
is
doing 800 RPS the tps is about 30 per drive (low..). 

4. After a period of time (almost always the same (+/- 60 seconds)
depending
on RPS) the ESTABLISHED connections start rising, at the exact same time
the
PolyClt starts showing less RPS. This is the "slow down" as such. 

5. Because of this, polyclt continues to send requests which the unit
continues to accept - quickly all available sockets are used, and the unit
will then crash

Interestingly enough though - if I stop the polyclt when this happens and
restart it - in under 10 seconds - it continues on for another x minutes
without problem. If I leave it running the unit never comes right.

I have used "systat -vmstat 1", "systat -tcp 1", "systat -iostat 1" and
all
the stats from Munin, and a MRTG graphing config for squid and they all
show
nothing of interest. The only result that changes between working time and
slow down is that the connections go through the roof as explained
above...
I have also seen it fail at 300RPS, but only after 82 minutes - which
seems
like a very long time if it was going to fail because of disk load. The
entire time the disks are very underloaded. That said, if I use a null
cache
directory this doesn't happen....

I know that sounds like its clearly drives - but 82 minutes ??

Thanks for all the help
Dave

-----Original Message-----
From: Adrian Chadd [mailto:adrian@xxxxxxxxxxxxxxx] 
Sent: Friday, November 09, 2007 11:55 AM
To: Dave Raven
Cc: 'Adrian Chadd'; squid-users@xxxxxxxxxxxxxxx
Subject: Re:  Squid Performance (with Polygraph)

Check netstat -mb and see if you're running out of mbufs?
You haven't mentioned whether the CPU is being pegged at this point?

Adrian

On Fri, Nov 09, 2007, Dave Raven wrote:
Hi all,
	Okay I've done some of what you requested, and unfortunately failed
to find anything specific. I can pretty much guarantee the times at
which
the requests will slow down now. 600RPS = 15 minutes, 800 RPS = 11
minutes,
400 RPS = ~80 minutes. 

During that time (before and during the problem) systat -vmstat 1 shows
the
same interrupts - about 4000 on em1 (ifac) and 250 on hptmv0 - my
controller
for the SATA drives. 

If I use a systat -iostat 1 I can see that none of the drives are 100%
utilized at any time during the test. Systat -tcp 1 also doesn't show me
anything out of the ordinary. I have setup munin to monitor the host but
unfortunately its not showing much. 

Also the problem is that when the problem begins, it starts filling up
network connections - once it fills all the available ports nothing can
monitor it :/

I'm going to try use a different network card, then a different
motherboard
etc - try some different setups today. Thanks again for all the help and
please let me know if anyone has any ideas...

Thanks
Dave

-----Original Message-----
From: Adrian Chadd [mailto:adrian@xxxxxxxxxxxxxxx] 
Sent: Friday, November 09, 2007 4:08 AM
To: Dave Raven
Cc: squid-users@xxxxxxxxxxxxxxx
Subject: Re:  Squid Performance (with Polygraph)

On Thu, Nov 08, 2007, Dave Raven wrote:
Hi Adrian,
 What would cause it to fail after a specific time though - if the
cache_mem
is already full and its using the drives? I would have thought it
would
fail
immediately ? 

Also there are no log messages about failures or anything...
Who knows :) its hard without having remote access, or lots of logging/
statistics to correlate the trouble times with.

Try installing munin and graph all the system-specific stuff. See what
correlates against the failure time. You might notice something, like
out of memory/paging, or an increase in interrupts, or something. ;)

Thats all I can offer at the present time, sorry.

Adrian

--
- Xenion - http://www.xenion.com.au/ - VPS Hosting - Commercial Squid
Support -
- $25/pm entry-level VPSes w/ capped bandwidth charges available in WA -
--
- Xenion - http://www.xenion.com.au/ - VPS Hosting - Commercial Squid
Support -
- $25/pm entry-level VPSes w/ capped bandwidth charges available in WA -

I'll spend my remaining 1 cent suggesting a full dump of SMART 
parameters before and after each test. Maybe by looking at how the smart 
counters vary a clue may come out... :-)

HTH

--
Marcello Romani
Responsabile IT
Ottotecnica s.r.l.
http://www.ottotecnica.com