I notice several things in the data below:
Throughput:
* you have persistent connections disabled
=> thus each request requires a new set of TCP sockets
* TIME_WAIT lasts for between 5-15 minutes
(after 250 users mark)
* you are receiving 57 req/sec
=> thus 114 sockets/sec are being used and set into TIME_WAIT state
* at 114 sockets/sec 64K sockets will be consumed in around 11 minutes
... 10-12 minutes after the start ramp-up you have a sharp drop in the
graph. This recovers after a minute or two to the previous level.
(at degradation)
* you are receiving 91 req/sec
=> thus 180 sockets/sec are being used and set into TIME_WAIT state
* at 180 sockets/sec 64K sockets will be consumed in around 6 minutes
.. strangely 10 minutes after the early trough you have a sharp peak of
traffic. Followed by the long slide in performance.
Theory:
pt1) 10 minutes in the available TCP sockets are all fully in-use or
TIME_WAIT. This causes a drop in received connections and thus
throughput until TIME_WAIT sockets become available.
pt2) when the TIMEWAIT for sockets after the trough are released they
are picked up by a higher backlog of client connections. Reducing the
amount of sockets for server connections - which are where the traffic
is sourced.
Also, your reports consistently show socket accept() as being much
higher than socket connect(). This is only reasonable if the balance is
taken up as HIT traffic. Your HIT rate is way too low to account for all
of that difference.
Experiment: try enabling server and client persistent connections.
With a short idle timeout (~15sec) if you need fast turnover.
CPU usage:
* sits between 4-12% while pumping traffic even in the past-capacity
slowdown period
=> not nearly enough to be a procesing bottlneck.
Select loops:
* 1K/sec under the fast traffic period
* relaying 3.5MB/sec
* 7K/sec and 9K/sec in the periods you indicate as slow
* relaying 4.7MB/sec
=> hints that Squid is looping once per packet or so.
Amos