Linux memory zone reclaim

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Newer Linux systems with lots of cores have a problem I've been running into a lot more lately I wanted to share initial notes on. My "newer" means running the 2.6.32 kernel or later, since I mostly track "enterprise" Linux distributions like RHEL6 and Debian Squeeze. The issue is around Linux's zone_reclaim feature. When it pops up, turning that feature off help a lot. Details on what I understand of the problem are below, and as always things may have changed already in even newer kernels.

zone_reclaim tries to optimize memory speed on NUMA systems with more than one CPU socket. There some banks of memory that can be "closer" to a particular socket, as measured by transfer rate, because of how the memory is routed to the various cores on each socket. There is no true default for this setting. Linux checks the hardware and turns this on/off based on what transfer rate it sees between NUMA nodes, where there are more than one and its test shows some distance between them. You can tell if this is turned on like this:

echo /proc/sys/vm/zone_reclaim_mode

Where 1 means it's enabled. Install the numactl utility and you can see why it's made that decision:

# numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 12 13 14 15 16 17
node 0 size: 73718 MB
node 0 free: 419 MB
node 1 cpus: 6 7 8 9 10 11 18 19 20 21 22 23
node 1 size: 73728 MB
node 1 free: 30 MB
node distances:
node   0   1
  0:  10  21
  1:  21  10

Note how the "distance" for a transfer from node 0->0 or 1->1 is 10 units, while 0->1 or 1->0 is 21. That what's tested at boot time, where the benchmarked speed is turned into this abstract distance number. And if there is a large difference in cross-zone timing, then zone reclaim is enabled.

Scott Marlowe has been griping about this on the mailing lists here for a while now, and it's increasingly trouble for systems I've been seeing lately too. This is a well known problem with MySQL: http://blog.jcole.us/2010/09/28/mysql-swap-insanity-and-the-numa-architecture/ and NUMA issues have impacted Oracle too. On PostgreSQL shared_buffers isn't normally set as high as MySQL's buffer cache, making it a bit less vulnerable to this class of problem. But it's surely still a big problem for PostgreSQL on some systems.

I've taken to disabling /proc/sys/vm/zone_reclaim_mode on any Linux system where it's turned on now. I'm still working through whether it also makes sense in all cases to use the more complicated memory interleaving suggestions that MySQL users have implemented, something most people would need to push into their PostgreSQL server started up scripts in /etc/init.d (That will be a fun rpm/deb packaging issue to deal with if this becomes more wide-spread) Suggestions on whether that is necessary, or if just disabling zone_reclaim is enough, are welcome from anyone who wants to try and benchmark it.

Note that this is all tricky to test because some of the bad behavior only happens when the server runs this zone reclaim method, which isn't a trivial situation to create at will. Servers that have this problem tend to have it pop up intermittently, you'll see one incredibly slow query periodically while most are fast. All depends on exactly what core is executing, where the memory it needs is at, and whether the server wants to reclaim memory (and just what that means its own complicated topic) as part of that.

--
Greg Smith   2ndQuadrant US    greg@xxxxxxxxxxxxxxx   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[Postgresql General]     [Postgresql PHP]     [PHP Users]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Yosemite]

  Powered by Linux