Hi all, Okay I managed to do a lot more testing at the office today. Firstly some of the questions asked -- CPU Usage: The cpu usage is around 30% during the test, when the unit begins to fail it actually goes down a bit. Mbufs/Clusters: All fine - they do rise quickly after the problem happens but this is because the established network connections are still coming in 600 a second, but only being satisfied at a rate of say 200 a second. The send queues then get big, and mbuf usage goes up - this is not the cause of the failure though, it's a side effect. For the first x minutes its between 250 and 3000 mbufs (and clusters) used, and my max is 65k/32k As for system logs there are none - there is nothing suspicious anywhere until the side effects kick in, e.g. mbufs running out etc. Squid also logs nothing at all. I've also checked if I'm using too much memory and that's not the case - swap is not used at all during the entire test. This is the process of what happens -- 1. PolyClt + PolySrv begin, 800 RPS. 2. ESTABLISHED netstat connections are around 2000 once 800RPS is reached (about 20 seconds). CPU load is 30%, mbufs are available etc. 3. Once memory becomes full (quickly) disk drive usage begins - squid -z puts the TPS per drive at well over 1000/s when I run it, when the cache is doing 800 RPS the tps is about 30 per drive (low..). 4. After a period of time (almost always the same (+/- 60 seconds) depending on RPS) the ESTABLISHED connections start rising, at the exact same time the PolyClt starts showing less RPS. This is the "slow down" as such. 5. Because of this, polyclt continues to send requests which the unit continues to accept - quickly all available sockets are used, and the unit will then crash Interestingly enough though - if I stop the polyclt when this happens and restart it - in under 10 seconds - it continues on for another x minutes without problem. If I leave it running the unit never comes right. I have used "systat -vmstat 1", "systat -tcp 1", "systat -iostat 1" and all the stats from Munin, and a MRTG graphing config for squid and they all show nothing of interest. The only result that changes between working time and slow down is that the connections go through the roof as explained above... I have also seen it fail at 300RPS, but only after 82 minutes - which seems like a very long time if it was going to fail because of disk load. The entire time the disks are very underloaded. That said, if I use a null cache directory this doesn't happen.... I know that sounds like its clearly drives - but 82 minutes ?? Thanks for all the help Dave -----Original Message----- From: Adrian Chadd [mailto:adrian@xxxxxxxxxxxxxxx] Sent: Friday, November 09, 2007 11:55 AM To: Dave Raven Cc: 'Adrian Chadd'; squid-users@xxxxxxxxxxxxxxx Subject: Re: Squid Performance (with Polygraph) Check netstat -mb and see if you're running out of mbufs? You haven't mentioned whether the CPU is being pegged at this point? Adrian On Fri, Nov 09, 2007, Dave Raven wrote: > Hi all, > Okay I've done some of what you requested, and unfortunately failed > to find anything specific. I can pretty much guarantee the times at which > the requests will slow down now. 600RPS = 15 minutes, 800 RPS = 11 minutes, > 400 RPS = ~80 minutes. > > During that time (before and during the problem) systat -vmstat 1 shows the > same interrupts - about 4000 on em1 (ifac) and 250 on hptmv0 - my controller > for the SATA drives. > > If I use a systat -iostat 1 I can see that none of the drives are 100% > utilized at any time during the test. Systat -tcp 1 also doesn't show me > anything out of the ordinary. I have setup munin to monitor the host but > unfortunately its not showing much. > > Also the problem is that when the problem begins, it starts filling up > network connections - once it fills all the available ports nothing can > monitor it :/ > > I'm going to try use a different network card, then a different motherboard > etc - try some different setups today. Thanks again for all the help and > please let me know if anyone has any ideas... > > Thanks > Dave > > -----Original Message----- > From: Adrian Chadd [mailto:adrian@xxxxxxxxxxxxxxx] > Sent: Friday, November 09, 2007 4:08 AM > To: Dave Raven > Cc: squid-users@xxxxxxxxxxxxxxx > Subject: Re: Squid Performance (with Polygraph) > > On Thu, Nov 08, 2007, Dave Raven wrote: > > Hi Adrian, > > What would cause it to fail after a specific time though - if the > cache_mem > > is already full and its using the drives? I would have thought it would > fail > > immediately ? > > > > Also there are no log messages about failures or anything... > > Who knows :) its hard without having remote access, or lots of logging/ > statistics to correlate the trouble times with. > > Try installing munin and graph all the system-specific stuff. See what > correlates against the failure time. You might notice something, like > out of memory/paging, or an increase in interrupts, or something. ;) > > Thats all I can offer at the present time, sorry. > > > > Adrian > > -- > - Xenion - http://www.xenion.com.au/ - VPS Hosting - Commercial Squid > Support - > - $25/pm entry-level VPSes w/ capped bandwidth charges available in WA - -- - Xenion - http://www.xenion.com.au/ - VPS Hosting - Commercial Squid Support - - $25/pm entry-level VPSes w/ capped bandwidth charges available in WA -