Re: Linux swapping with MySQL/InnoDB due to NUMA architecture imbalanced allocations?

Jeremy Cole <jeremy@xxxxxxxx> · Mon, 27 Sep 2010 18:58:44 -0700

Dave,

Thanks for your response.  This is helpful.  And, my testing with
"numactl --interleave=all" is going well, so far testing indicates
that it completely eliminates the swapping, without incurring a
measurable performance penalty for my workload.

> Your situation sounds pretty familiar.  It happens a lot when
> applications are moved over to a NUMA system for the first time.  Your
> interleaving solution is a decent one, although teaching the database
> about NUMA is a much better long-term approach.

That's exactly my thought.  I've read through the NUMA API
documentation, and it looks like it's not at all insurmountable to, at
the very least, ensure that the big cache allocations (InnoDB buffer
pool, MyISAM key buffer, etc.) are done interleaved, while leaving
most of the rest alone.  My biggest worry with using "numactl
--interleave=all" is that all of the small buffers allocated for the
use of a single thread (for instance query text buffer, sorting
buffers, etc.) will get spread around.  I'm still working to
completely understand the implications of this on performance, but I
don't think it will be terribly bad -- much better than the current
swapping situation, certainly.

> As far as the decisions about running reclaim or swapping versus going
> to another node for an allocation, take a look at the
> "zone_reclaim_mode" bits in Documentation/sysctl/vm.txt .  It does a
> decent job of explaining what we do.

I had read about zone_reclaim_mode, and I've also been testing
different settings for it, but I don't think it actually completely
solves the situation here.  It seems to be primarily concerned with
allocations that *could* happen anywhere, whereas I think what we're
often seeing is that memory for whatever reason (which is not
completely obvious to me) *must* be allocated on Node X, but Node X
has no free memory and no caches to free.

Nonetheless, I have to admit that I don't completely understand the
documentation for zone_reclaim_mode in its current form.  Perhaps you
could answer a few questions?  I feel that the documentation could be
updated with some important answers, which are missing now:

1. What "zone reclaim" actually means.  My understanding is that "zone
reclaim" is the practice of freeing memory on a specific node where
memory was preferentially requested (due to NUMA memory allocation
policy, by default "local") in favor of satisfying the allocation
using free memory from wherever it is currently available.

2. It isn't terribly clear what the default (0) policy is, and it
could use an explanation.  Here's my take on it:

When zone_reclaim_mode = 0, programs requesting memory to be allocated
on a particular node will only receive memory on the requested node if
free memory is available.  If no free memory is available on the
requested node, but free memory is available on a different node, the
allocation will be made there unless policy forbids it.  If no free
memory is available on any node, then the normal cache freeing and
paging out policies will apply to make free memory available on any
node to satisfy the allocation. [Is there any preference for which
node caches are freed from in this case?]

Is this correct?

3. I found that the list of possible values' descriptions are a bit
too terse to be usable by me.  Here are some efforts to refine the
definitions:

  a. "1 = Zone reclaim on" -- This means that cache pages will be
freed to make free memory to satisfy the request only if they are not
dirty.

  b. "2 = Zone reclaim writes dirty pages out" -- This means that
dirty cache pages will be written out and then freed if no clean pages
are available to be freed.  This incurs additional cost due to disk
I/O.

  c. "4 = Zone reclaim swaps pages" -- This means that anonymous pages
may be swapped out to disk and then freed if no clean pages are
available to be freed and (if bit 2 is set) no dirty cache pages are
available to be written out and freed.  This incurs additional cost
due to swap I/O.

Do those refinements make sense and are they correct?

4. How is it determined that "pages from remote zones will cause a
measurable performance reduction"?  My understanding is that this is
based on whether the node distance, as reported by "numactl
--hardware" is > RECLAIM_DISTANCE (by default defined as 20).  In this
case zone_reclaim_mode will be set to 1 by default by the kernel,
meaning cache pages may be freed on the particular node to make free
memory in order to preferentially allocate for programs that request
on a particular node.

5. I cannot parse/understand this statement at all: "Allowing regular
swap effectively restricts allocations to the local node unless
explicitly overridden by memory policies or cpuset configurations." --
Could this be rephrased and/or explained?

Thanks, again, everyone.

Regards,

Jeremy

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxxx  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href