Newer Linux systems with lots of cores have a problem I've been running
into a lot more lately I wanted to share initial notes on. My "newer"
means running the 2.6.32 kernel or later, since I mostly track
"enterprise" Linux distributions like RHEL6 and Debian Squeeze. The
issue is around Linux's zone_reclaim feature. When it pops up, turning
that feature off help a lot. Details on what I understand of the
problem are below, and as always things may have changed already in even
newer kernels.
zone_reclaim tries to optimize memory speed on NUMA systems with more
than one CPU socket. There some banks of memory that can be "closer" to
a particular socket, as measured by transfer rate, because of how the
memory is routed to the various cores on each socket. There is no true
default for this setting. Linux checks the hardware and turns this
on/off based on what transfer rate it sees between NUMA nodes, where
there are more than one and its test shows some distance between them.
You can tell if this is turned on like this:
echo /proc/sys/vm/zone_reclaim_mode
Where 1 means it's enabled. Install the numactl utility and you can see
why it's made that decision:
# numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 12 13 14 15 16 17
node 0 size: 73718 MB
node 0 free: 419 MB
node 1 cpus: 6 7 8 9 10 11 18 19 20 21 22 23
node 1 size: 73728 MB
node 1 free: 30 MB
node distances:
node 0 1
0: 10 21
1: 21 10
Note how the "distance" for a transfer from node 0->0 or 1->1 is 10
units, while 0->1 or 1->0 is 21. That what's tested at boot time, where
the benchmarked speed is turned into this abstract distance number. And
if there is a large difference in cross-zone timing, then zone reclaim
is enabled.
Scott Marlowe has been griping about this on the mailing lists here for
a while now, and it's increasingly trouble for systems I've been seeing
lately too. This is a well known problem with MySQL:
http://blog.jcole.us/2010/09/28/mysql-swap-insanity-and-the-numa-architecture/
and NUMA issues have impacted Oracle too. On PostgreSQL shared_buffers
isn't normally set as high as MySQL's buffer cache, making it a bit less
vulnerable to this class of problem. But it's surely still a big
problem for PostgreSQL on some systems.
I've taken to disabling /proc/sys/vm/zone_reclaim_mode on any Linux
system where it's turned on now. I'm still working through whether it
also makes sense in all cases to use the more complicated memory
interleaving suggestions that MySQL users have implemented, something
most people would need to push into their PostgreSQL server started up
scripts in /etc/init.d (That will be a fun rpm/deb packaging issue to
deal with if this becomes more wide-spread) Suggestions on whether that
is necessary, or if just disabling zone_reclaim is enough, are welcome
from anyone who wants to try and benchmark it.
Note that this is all tricky to test because some of the bad behavior
only happens when the server runs this zone reclaim method, which isn't
a trivial situation to create at will. Servers that have this problem
tend to have it pop up intermittently, you'll see one incredibly slow
query periodically while most are fast. All depends on exactly what
core is executing, where the memory it needs is at, and whether the
server wants to reclaim memory (and just what that means its own
complicated topic) as part of that.
--
Greg Smith 2ndQuadrant US greg@xxxxxxxxxxxxxxx Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
--
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance