When I checked these, both of these settings exist on my CentOS 6.x host
(2.6.32-279.5.1.el6.x86_64).
However, the autogroup_enabled was already set to 0. (The
migration_cost was set to the 0.5ms, default noted in the OP.) So I
don't know if this is strictly limited to kernel 3.0.
Is there an "easy" way to tell what scheduler my OS is using?
-AJ
On 1/8/2013 2:32 PM, Shaun Thomas wrote:
On 01/08/2013 01:04 PM, Scott Marlowe wrote:
Assembly language on the brain. of course I meant NOOP.
Ok, in that case, these are completely separate things. For IO
scheduling, there's the Completely Fair Queue (CFQ), NOOP, Deadline,
and so on.
For process scheduling, at least recently, there's Completely Fair
Scheduler or nothing. So far as I can tell, there is no alternative
process scheduler. Just as I can't find an alternative memory manager
that I can tell to stop flushing my freaking active file cache due to
phantom memory pressure. ;)
The tweaks I was discussing in this thread effectively do two things:
1. Stop process grouping by TTY.
On servers, this really is a net performance loss. Especially on
heavily forked apps like PG. System % is about 5% lower since the
scheduler is doing less work, but at the cost of less spreading across
available CPUs. Our systems see a 30% performance hit with grouping
enabled, others may see more or less.
2. Less aggressive process scheduling.
The O(log N) scheduler heuristics collapse at high process counts for
some reason, causing the scheduler to spend more and more time
planning CPU assignments until it spirals completely out of control.
I've seen this behavior on 3.0 kernels straight to 3.5, so it looks
like an inherent weakness of CFS. By increasing migration cost, we
make the scheduler do less work less often, so that weird 70+% system
CPU spike vanishes.
My guess is the increased migration cost basically offsets the point
at which the scheduler would freak out. I've tested up to 2000
connections, and it responds fine, whereas before we were seeing flaky
results as early as 700 connections.
My guess as to why this is? I think it's due to VSZ as perceived by
the scheduler. To swap processes, it also has to preload L2 and L3
cache for the assigned process. As the number of PG connections
increase, all with their own VSZ/RSS allocations, the scheduler has
more thinking to do. At a point when the sum of VSZ/RSS eclipses the
amount of available RAM, the scheduler loses nearly all
decision-making ability and craps its pants.
This would also explain why I'm seeing something similar with memory.
At high connection counts, even though %used is fine, and we have over
40GB free for caching. VSZ/RSS are both way bigger than available
cache, so memory pressure causes kswapd to continuously purge the
active cache pool into inactive, and inactive into free, all while the
device attempts to fill the active pool. It's an IO feedback loop, and
around the same number of connections that used to make the process
scheduler die. Too much of a coincidence, in my opinion.
But unlike the process scheduler, there are no good knobs to turn that
will fix the memory manager's behavior. At least, not in 3.0, 3.2, or
3.4 kernels.
But I freely admit I'm just speculating based on observed behavior. I
know neither jack, nor squat about internal kernel mechanics. Anyone
who actually *isn't* talking out of his ass is free to interject. :)
--
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance