Re: Two Necessary Kernel Tweaks for Linux Systems

"Midge Brown" <midgems@xxxxxxxxxxxxx> · Tue, 8 Jan 2013 10:25:52 -0800

The kernel on our Linux system doesn't appear to 
have these two settings according to the list provided by sysctl -a. Please 
pardon my ignorance, but should I add them? 

We have Postgresql 9.0 on Linux 2.6.18-164.el5 #1 
SMP Thu Sep 3 03:28:30 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux

Thanks,
Midge

  ----- Original Message ----- 
  From: 
  Shaun 
  Thomas 
  To: pgsql-performance@xxxxxxxxxxxxxx 

  Sent: Wednesday, January 02, 2013 1:46 
  PM
  Subject:  Two Necessary Kernel 
  Tweaks for Linux Systems

Hey everyone!

After much testing and hair-pulling, we've 
  confirmed two kernel settings 
that should always be modified in production 
  Linux systems. Especially 
new ones with the completely fair scheduler 
  (CFS) as opposed to the O(1) 
scheduler.

If you want to follow 
  along, these 
  are:

/proc/sys/kernel/sched_migration_cost
/proc/sys/kernel/sched_autogroup_enabled

Which 
  correspond to sysctl 
  settings:

kernel.sched_migration_cost
kernel.sched_autogroup_enabled

What 
  do these settings do?
--------------------------

* 
  sched_migration_cost

The migration cost is the total time the scheduler 
  will consider a 
migrated process "cache hot" and thus less likely to be 
  re-migrated. By 
default, this is 0.5ms (500000 ns), and as the size of the 
  process table 
increases, eventually causes the scheduler to break down. On 
  our 
systems, after a smooth degradation with increasing connection count, 

system CPU spiked from 20 to 70% sustained and TPS was cut by 5-10x once 

we crossed some invisible connection count threshold. For us, that was a 

pgbench with 900 or more clients.

The migration cost should be 
  increased, almost universally on server 
systems with many processes. This 
  means systems like PostgreSQL or 
Apache would benefit from having higher 
  migration costs. We've had good 
luck with a setting of 5ms (5000000 ns) 
  instead.

When the breakdown occurs, system CPU (as obtained from sar) 
  increases 
from 20% on a heavy pgbench (scale 3500 on a 72GB system) to 
  over 70%, 
and %nice/%user is cut by half or more. A higher migration cost 

essentially eliminates this artificial throttle.

* 
  sched_autogroup_enabled

This is a relatively new patch which Linus 
  lauded back in late 2010. It 
basically groups tasks by TTY so perceived 
  responsiveness is improved. 
But on server systems, large daemons like 
  PostgreSQL are going to be 
launched from the same pseudo-TTY, and be 
  effectively choked out of CPU 
cycles in favor of less important 
  tasks.

The default setting is 1 (enabled) on some platforms. By setting 
  this to 
0 (disabled), we saw an outright 30% performance boost on the same 

pgbench test. A fully cached scale 3500 database on a 72GB system went 

from 67k TPS to 82k TPS with 900 client connections.

Total 
  Benefit
-------------

At higher connections counts, such as systems 
  that can't use pooling or 
make extensive use of prepared queries, these 
  can massively affect 
performance. At 900 connections, our test systems 
  were at 17k TPS 
unaltered, but 85k TPS after these two modifications. Even 
  with this 
performance boost, we still had 40% CPU free instead of 0%. In 
  effect, 
the logarithmic performance of the new scheduler is returned to 
  normal 
under large process tables.

Some systems will have a higher 
  "cracking" point than others. The effect 
is amplified when a system is 
  under high memory pressure, hence a lot of 
expensive queries on a high 
  number of concurrent connections is the 
easiest way to replicate these 
  results.

Admins migrating from older systems (RHEL 5.x) may find this 
  especially 
shocking, because the old O(1) scheduler was too "stupid" to 
  have these 
advanced features, hence it was impossible to cause this kind 
  of behavior.

There's probably still a little room for improvement here, 
  since 30-40% 
CPU is still unclaimed in our larger tests. I'd like to see 
  the total 
performance drop (175k ideal TPS at 24-connections) decreased. 
  But these 
kernel tweaks are rarely discussed anywhere, it seems. There 
  doesn't 
seem to be any consensus on how these (and other) scheduler 
  settings 
should be modified under different usage scenarios.

I just 
  figured I'd share, since we found this info so beneficial.

-- 
Shaun 
  Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 
  60604
312-444-8534
sthomas@xxxxxxxxxxxxxxxx

______________________________________________

See 
  http://www.peak6.com/email_disclaimer/ 
  for terms and conditions related to this email

-- 
Sent via 
  pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To 
  make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance