Linux memory zone reclaim

Greg Smith <greg@xxxxxxxxxxxxxxx> · Tue, 17 Jul 2012 21:52:11 -0400

Newer Linux systems with lots of cores have a problem I've been running 
into a lot more lately I wanted to share initial notes on.  My "newer" 
means running the 2.6.32 kernel or later, since I mostly track 
"enterprise" Linux distributions like RHEL6 and Debian Squeeze.  The 
issue is around Linux's zone_reclaim feature.  When it pops up, turning 
that feature off help a lot.  Details on what I understand of the 
problem are below, and as always things may have changed already in even 
newer kernels.

zone_reclaim tries to optimize memory speed on NUMA systems with more 
than one CPU socket.  There some banks of memory that can be "closer" to 
a particular socket, as measured by transfer rate, because of how the 
memory is routed to the various cores on each socket.  There is no true 
default for this setting.  Linux checks the hardware and turns this 
on/off based on what transfer rate it sees between NUMA nodes, where 
there are more than one and its test shows some distance between them.  
You can tell if this is turned on like this:

echo /proc/sys/vm/zone_reclaim_mode

Where 1 means it's enabled.  Install the numactl utility and you can see 
why it's made that decision:

# numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 12 13 14 15 16 17
node 0 size: 73718 MB
node 0 free: 419 MB
node 1 cpus: 6 7 8 9 10 11 18 19 20 21 22 23
node 1 size: 73728 MB
node 1 free: 30 MB
node distances:
node   0   1
  0:  10  21
  1:  21  10

Note how the "distance" for a transfer from node 0->0 or 1->1 is 10 
units, while 0->1 or 1->0 is 21.  That what's tested at boot time, where 
the benchmarked speed is turned into this abstract distance number.  And 
if there is a large difference in cross-zone timing, then zone reclaim 
is enabled.

Scott Marlowe has been griping about this on the mailing lists here for 
a while now, and it's increasingly trouble for systems I've been seeing 
lately too.  This is a well known problem with MySQL:  
http://blog.jcole.us/2010/09/28/mysql-swap-insanity-and-the-numa-architecture/ 
and NUMA issues have impacted Oracle too.  On PostgreSQL shared_buffers 
isn't normally set as high as MySQL's buffer cache, making it a bit less 
vulnerable to this class of problem.  But it's surely still a big 
problem for PostgreSQL on some systems.

I've taken to disabling /proc/sys/vm/zone_reclaim_mode on any Linux 
system where it's turned on now.  I'm still working through whether it 
also makes sense in all cases to use the more complicated memory 
interleaving suggestions that MySQL users have implemented, something 
most people would need to push into their PostgreSQL server started up 
scripts in /etc/init.d  (That will be a fun rpm/deb packaging issue to 
deal with if this becomes more wide-spread)  Suggestions on whether that 
is necessary, or if just disabling zone_reclaim is enough, are welcome 
from anyone who wants to try and benchmark it.

Note that this is all tricky to test because some of the bad behavior 
only happens when the server runs this zone reclaim method, which isn't 
a trivial situation to create at will.  Servers that have this problem 
tend to have it pop up intermittently, you'll see one incredibly slow 
query periodically while most are fast.  All depends on exactly what 
core is executing, where the memory it needs is at, and whether the 
server wants to reclaim memory (and just what that means its own 
complicated topic) as part of that.

--
Greg Smith   2ndQuadrant US    greg@xxxxxxxxxxxxxxx   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com

--
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance