Re: Two Necessary Kernel Tweaks for Linux Systems

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



When I checked these, both of these settings exist on my CentOS 6.x host (2.6.32-279.5.1.el6.x86_64).

However, the autogroup_enabled was already set to 0. (The migration_cost was set to the 0.5ms, default noted in the OP.) So I don't know if this is strictly limited to kernel 3.0.

Is there an "easy" way to tell what scheduler my OS is using?

-AJ


On 1/8/2013 2:32 PM, Shaun Thomas wrote:
On 01/08/2013 01:04 PM, Scott Marlowe wrote:

Assembly language on the brain.  of course I meant NOOP.

Ok, in that case, these are completely separate things. For IO scheduling, there's the Completely Fair Queue (CFQ), NOOP, Deadline, and so on.

For process scheduling, at least recently, there's Completely Fair Scheduler or nothing. So far as I can tell, there is no alternative process scheduler. Just as I can't find an alternative memory manager that I can tell to stop flushing my freaking active file cache due to phantom memory pressure. ;)

The tweaks I was discussing in this thread effectively do two things:

1. Stop process grouping by TTY.

On servers, this really is a net performance loss. Especially on heavily forked apps like PG. System % is about 5% lower since the scheduler is doing less work, but at the cost of less spreading across available CPUs. Our systems see a 30% performance hit with grouping enabled, others may see more or less.

2. Less aggressive process scheduling.

The O(log N) scheduler heuristics collapse at high process counts for some reason, causing the scheduler to spend more and more time planning CPU assignments until it spirals completely out of control. I've seen this behavior on 3.0 kernels straight to 3.5, so it looks like an inherent weakness of CFS. By increasing migration cost, we make the scheduler do less work less often, so that weird 70+% system CPU spike vanishes.

My guess is the increased migration cost basically offsets the point at which the scheduler would freak out. I've tested up to 2000 connections, and it responds fine, whereas before we were seeing flaky results as early as 700 connections.

My guess as to why this is? I think it's due to VSZ as perceived by the scheduler. To swap processes, it also has to preload L2 and L3 cache for the assigned process. As the number of PG connections increase, all with their own VSZ/RSS allocations, the scheduler has more thinking to do. At a point when the sum of VSZ/RSS eclipses the amount of available RAM, the scheduler loses nearly all decision-making ability and craps its pants.

This would also explain why I'm seeing something similar with memory. At high connection counts, even though %used is fine, and we have over 40GB free for caching. VSZ/RSS are both way bigger than available cache, so memory pressure causes kswapd to continuously purge the active cache pool into inactive, and inactive into free, all while the device attempts to fill the active pool. It's an IO feedback loop, and around the same number of connections that used to make the process scheduler die. Too much of a coincidence, in my opinion.

But unlike the process scheduler, there are no good knobs to turn that will fix the memory manager's behavior. At least, not in 3.0, 3.2, or 3.4 kernels.

But I freely admit I'm just speculating based on observed behavior. I know neither jack, nor squat about internal kernel mechanics. Anyone who actually *isn't* talking out of his ass is free to interject. :)



--
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[Postgresql General]     [Postgresql PHP]     [PHP Users]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Yosemite]

  Powered by Linux