Re: Two Necessary Kernel Tweaks for Linux Systems

Shaun Thomas <sthomas@xxxxxxxxxxxxxxxx> · Tue, 8 Jan 2013 13:32:14 -0600

On 01/08/2013 01:04 PM, Scott Marlowe wrote:

Assembly language on the brain.  of course I meant NOOP.

Ok, in that case, these are completely separate things. For IO 
scheduling, there's the Completely Fair Queue (CFQ), NOOP, Deadline, and 
so on.

For process scheduling, at least recently, there's Completely Fair 
Scheduler or nothing. So far as I can tell, there is no alternative 
process scheduler. Just as I can't find an alternative memory manager 
that I can tell to stop flushing my freaking active file cache due to 
phantom memory pressure. ;)

The tweaks I was discussing in this thread effectively do two things:

1. Stop process grouping by TTY.

On servers, this really is a net performance loss. Especially on heavily 
forked apps like PG. System % is about 5% lower since the scheduler is 
doing less work, but at the cost of less spreading across available 
CPUs. Our systems see a 30% performance hit with grouping enabled, 
others may see more or less.

2. Less aggressive process scheduling.

The O(log N) scheduler heuristics collapse at high process counts for 
some reason, causing the scheduler to spend more and more time planning 
CPU assignments until it spirals completely out of control. I've seen 
this behavior on 3.0 kernels straight to 3.5, so it looks like an 
inherent weakness of CFS. By increasing migration cost, we make the 
scheduler do less work less often, so that weird 70+% system CPU spike 
vanishes.

My guess is the increased migration cost basically offsets the point at 
which the scheduler would freak out. I've tested up to 2000 connections, 
and it responds fine, whereas before we were seeing flaky results as 
early as 700 connections.

My guess as to why this is? I think it's due to VSZ as perceived by the 
scheduler. To swap processes, it also has to preload L2 and L3 cache for 
the assigned process. As the number of PG connections increase, all with 
their own VSZ/RSS allocations, the scheduler has more thinking to do. At 
a point when the sum of VSZ/RSS eclipses the amount of available RAM, 
the scheduler loses nearly all decision-making ability and craps its pants.

This would also explain why I'm seeing something similar with memory. At 
high connection counts, even though %used is fine, and we have over 40GB 
free for caching. VSZ/RSS are both way bigger than available cache, so 
memory pressure causes kswapd to continuously purge the active cache 
pool into inactive, and inactive into free, all while the device 
attempts to fill the active pool. It's an IO feedback loop, and around 
the same number of connections that used to make the process scheduler 
die. Too much of a coincidence, in my opinion.

But unlike the process scheduler, there are no good knobs to turn that 
will fix the memory manager's behavior. At least, not in 3.0, 3.2, or 
3.4 kernels.

But I freely admit I'm just speculating based on observed behavior. I 
know neither jack, nor squat about internal kernel mechanics. Anyone who 
actually *isn't* talking out of his ass is free to interject. :)

--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
312-444-8534
sthomas@xxxxxxxxxxxxxxxx

______________________________________________

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email

--
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance