Re: Two Necessary Kernel Tweaks for Linux Systems

AJ Weber <aweber@xxxxxxxxxxx> · Tue, 08 Jan 2013 15:05:56 -0500

When I checked these, both of these settings exist on my CentOS 6.x host 
(2.6.32-279.5.1.el6.x86_64).

However, the autogroup_enabled was already set to 0.  (The 
migration_cost was set to the 0.5ms, default noted in the OP.)  So I 
don't know if this is strictly limited to kernel 3.0.

Is there an "easy" way to tell what scheduler my OS is using?

-AJ

On 1/8/2013 2:32 PM, Shaun Thomas wrote:
On 01/08/2013 01:04 PM, Scott Marlowe wrote:

Assembly language on the brain.  of course I meant NOOP.

Ok, in that case, these are completely separate things. For IO 
scheduling, there's the Completely Fair Queue (CFQ), NOOP, Deadline, 
and so on.

For process scheduling, at least recently, there's Completely Fair 
Scheduler or nothing. So far as I can tell, there is no alternative 
process scheduler. Just as I can't find an alternative memory manager 
that I can tell to stop flushing my freaking active file cache due to 
phantom memory pressure. ;)

The tweaks I was discussing in this thread effectively do two things:

1. Stop process grouping by TTY.

On servers, this really is a net performance loss. Especially on 
heavily forked apps like PG. System % is about 5% lower since the 
scheduler is doing less work, but at the cost of less spreading across 
available CPUs. Our systems see a 30% performance hit with grouping 
enabled, others may see more or less.

2. Less aggressive process scheduling.

The O(log N) scheduler heuristics collapse at high process counts for 
some reason, causing the scheduler to spend more and more time 
planning CPU assignments until it spirals completely out of control. 
I've seen this behavior on 3.0 kernels straight to 3.5, so it looks 
like an inherent weakness of CFS. By increasing migration cost, we 
make the scheduler do less work less often, so that weird 70+% system 
CPU spike vanishes.

My guess is the increased migration cost basically offsets the point 
at which the scheduler would freak out. I've tested up to 2000 
connections, and it responds fine, whereas before we were seeing flaky 
results as early as 700 connections.

My guess as to why this is? I think it's due to VSZ as perceived by 
the scheduler. To swap processes, it also has to preload L2 and L3 
cache for the assigned process. As the number of PG connections 
increase, all with their own VSZ/RSS allocations, the scheduler has 
more thinking to do. At a point when the sum of VSZ/RSS eclipses the 
amount of available RAM, the scheduler loses nearly all 
decision-making ability and craps its pants.

This would also explain why I'm seeing something similar with memory. 
At high connection counts, even though %used is fine, and we have over 
40GB free for caching. VSZ/RSS are both way bigger than available 
cache, so memory pressure causes kswapd to continuously purge the 
active cache pool into inactive, and inactive into free, all while the 
device attempts to fill the active pool. It's an IO feedback loop, and 
around the same number of connections that used to make the process 
scheduler die. Too much of a coincidence, in my opinion.

But unlike the process scheduler, there are no good knobs to turn that 
will fix the memory manager's behavior. At least, not in 3.0, 3.2, or 
3.4 kernels.

But I freely admit I'm just speculating based on observed behavior. I 
know neither jack, nor squat about internal kernel mechanics. Anyone 
who actually *isn't* talking out of his ass is free to interject. :)

--
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance