Re: NUMA and ceph ... zone_reclaim_mode

Mark Nelson <mark.nelson@xxxxxxxxxxx> · Mon, 12 Jan 2015 09:38:24 -0600

On 01/12/2015 07:47 AM, Dan van der Ster wrote:
(resending to list)

Hi Kyle,
I'd like to +10 this old proposal of yours. Let me explain why...

A couple months ago we started testing a new use-case with radosgw --
this new user is writing millions of small files and has been causing
us some headaches. Since starting these tests, the relevant OSDs have
been randomly freezing for up to ~60s at a time. We have dedicated
servers for this use-case, so it doesn't affect our important RBD
users, and the OSDs always came back anyway ("wrongly marked me
down"..."). So I didn't give this problem much attention, though I
guessed that we must be suffering from some network connectivity
problem.

But last week I started looking into this problem in more detail. With
increased debug_osd logs I saw that when these OSDs are getting marked
down, even the osd tick message is not printed for >30s. I also
correlated these outages with massive drops in cached memory -- it
looked as if an admin was running drop_caches on our live machines.
Here is what we saw:

     https://www.dropbox.com/s/418ve09b6m98tyc/Screenshot%202015-01-12%2010.04.16.png?dl=0

Notice the sawtooth cached pages. That server has 20 OSDs, each OSD
has ~1 million files totalling around 40GB (~40kB objects). Compare
that with a different OSD host, one that's used for Cinder RBD volumes
(and doesn't suffer from the freezing OSD problem).:

     https://www.dropbox.com/s/1lmra5wz7e7qxjy/Screenshot%202015-01-12%2010.11.37.png?dl=0

These RBD servers have identical hardware, but in this case the 20
OSDs each hold around 100k files totalling ~400GB (~4MB objects).

Clearly the 10x increase in num files on the radosgw OSDs appears to
be causing a problem. In fact, since the servers are pretty idle most
of the time, it appears that the _scrubbing_ of these 20 million files
per server is causing the problem. It seems that scrubbing is creating
quite some memory pressure (via the inode cache, especially), so I
started testing different vfs_cache_pressure values (1,10,1000,10000).
The only value that sort of helped was vfs_cache_pressure = 1, but
keeping all the inodes cached is a pretty extreme measure, and it
won't scale up when these OSDs are more full (they're only around 1%
full now!!)

Then I discovered the infamous behaviour of zone_reclaim_mode = 1, and
this old thread. And I read a bit more, e.g.

     http://engineering.linkedin.com/performance/optimizing-linux-memory-management-low-latency-high-throughput-databases
     http://rhaas.blogspot.ch/2014/06/linux-disables-vmzonereclaimmode-by.html

Indeed all our servers have zone_reclaim_mode = 1. Numerous DB
communities regard this option as very bad for servers -- MongoDB even
prints a warning message at startup if zone_reclaim_mode is enabled.
And finally, in recent kernels (since ~June 2014) zone_reclaim_mode is
disabled by default. The vm doc now says:

     zone_reclaim_mode is disabled by default. For file servers or
workloads that benefit from having their data cached,
zone_reclaim_mode should be left disabled as the caching effect is
likely to be more important than data locality.

I've set zone_reclaim_mode = 0 on these radosgw OSD servers, and the
freezing OSD problem has gone away. Here's a plot of a server that had
zone_reclaim_mode set to zero late on Jan 9th:

     https://www.dropbox.com/s/x5qyn1e1r6fasl5/Screenshot%202015-01-12%2011.47.27.png?dl=0

I also used numactl --interleave=all <ceph command> on one host, but
it doesn't appear to make a huge different beyond disabling numa zone
reclaim.

Moving forward, I think it would be good for Ceph to a least document
this behaviour, but better would be to also detect when
zone_reclaim_mode != 0 and warn the admin (like MongoDB does). This
line from the commit which disables it in the kernel is pretty wise,
IMHO: "On current machines and workloads it is often the case that
zone_reclaim_mode destroys performance but not all users know how to
detect this. Favour the common case and disable it by default."

Interestingly I was seeing behavior that looked like this as well, 
though it manifested as OSDs going down with internal heartbeat timeouts 
during heavy 4K read/write benchmarks.  I was able to observe major page 
faults on the OSD nodes correlated with the "frozen" period despite 
significant amounts of memory used for buffer cache.  I also went down 
the vm_cache_pressure path.  Changing it to 1 fixed the issue.  I didn't 
think to go back and look at zone_reclaim_mode though!  Since then the 
system has been upgraded to fedora 21 with a 3.17 kernel and the issue 
no longer manifested itself.  I suspect now that this is due to the new 
default behavior.  Perhaps this solves the puzzle.  What is interesting 
is that at least on our test node this behavior didn't occur prior to 
firefly.  It may be that some change we made exacerbated the problem. 
Anyway, thank you for the excellent analysis Dan!

Mark
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com