Re: NUMA and ceph ... zone_reclaim_mode

Dan van der Ster <daniel.vanderster@xxxxxxx> · Mon, 12 Jan 2015 14:47:43 +0100

(resending to list)

Hi Kyle,
I'd like to +10 this old proposal of yours. Let me explain why...

A couple months ago we started testing a new use-case with radosgw --
this new user is writing millions of small files and has been causing
us some headaches. Since starting these tests, the relevant OSDs have
been randomly freezing for up to ~60s at a time. We have dedicated
servers for this use-case, so it doesn't affect our important RBD
users, and the OSDs always came back anyway ("wrongly marked me
down"..."). So I didn't give this problem much attention, though I
guessed that we must be suffering from some network connectivity
problem.

But last week I started looking into this problem in more detail. With
increased debug_osd logs I saw that when these OSDs are getting marked
down, even the osd tick message is not printed for >30s. I also
correlated these outages with massive drops in cached memory -- it
looked as if an admin was running drop_caches on our live machines.
Here is what we saw:

    https://www.dropbox.com/s/418ve09b6m98tyc/Screenshot%202015-01-12%2010.04.16.png?dl=0

Notice the sawtooth cached pages. That server has 20 OSDs, each OSD
has ~1 million files totalling around 40GB (~40kB objects). Compare
that with a different OSD host, one that's used for Cinder RBD volumes
(and doesn't suffer from the freezing OSD problem).:

    https://www.dropbox.com/s/1lmra5wz7e7qxjy/Screenshot%202015-01-12%2010.11.37.png?dl=0

These RBD servers have identical hardware, but in this case the 20
OSDs each hold around 100k files totalling ~400GB (~4MB objects).

Clearly the 10x increase in num files on the radosgw OSDs appears to
be causing a problem. In fact, since the servers are pretty idle most
of the time, it appears that the _scrubbing_ of these 20 million files
per server is causing the problem. It seems that scrubbing is creating
quite some memory pressure (via the inode cache, especially), so I
started testing different vfs_cache_pressure values (1,10,1000,10000).
The only value that sort of helped was vfs_cache_pressure = 1, but
keeping all the inodes cached is a pretty extreme measure, and it
won't scale up when these OSDs are more full (they're only around 1%
full now!!)

Then I discovered the infamous behaviour of zone_reclaim_mode = 1, and
this old thread. And I read a bit more, e.g.

    http://engineering.linkedin.com/performance/optimizing-linux-memory-management-low-latency-high-throughput-databases
    http://rhaas.blogspot.ch/2014/06/linux-disables-vmzonereclaimmode-by.html

Indeed all our servers have zone_reclaim_mode = 1. Numerous DB
communities regard this option as very bad for servers -- MongoDB even
prints a warning message at startup if zone_reclaim_mode is enabled.
And finally, in recent kernels (since ~June 2014) zone_reclaim_mode is
disabled by default. The vm doc now says:

    zone_reclaim_mode is disabled by default. For file servers or
workloads that benefit from having their data cached,
zone_reclaim_mode should be left disabled as the caching effect is
likely to be more important than data locality.

I've set zone_reclaim_mode = 0 on these radosgw OSD servers, and the
freezing OSD problem has gone away. Here's a plot of a server that had
zone_reclaim_mode set to zero late on Jan 9th:

    https://www.dropbox.com/s/x5qyn1e1r6fasl5/Screenshot%202015-01-12%2011.47.27.png?dl=0

I also used numactl --interleave=all <ceph command> on one host, but
it doesn't appear to make a huge different beyond disabling numa zone
reclaim.

Moving forward, I think it would be good for Ceph to a least document
this behaviour, but better would be to also detect when
zone_reclaim_mode != 0 and warn the admin (like MongoDB does). This
line from the commit which disables it in the kernel is pretty wise,
IMHO: "On current machines and workloads it is often the case that
zone_reclaim_mode destroys performance but not all users know how to
detect this. Favour the common case and disable it by default."

Cheers, Dan

On Thu, Dec 12, 2013 at 11:30 PM, Kyle Bader <kyle.bader@xxxxxxxxx> wrote:
> It seems that NUMA can be problematic for ceph-osd daemons in certain
> circumstances. Namely it seems that if a NUMA zone is running out of
> memory due to uneven allocation it is possible for a NUMA zone to
> enter reclaim mode when threads/processes are scheduled on a core in
> that zone and those processes are request memory allocations greater
> than the zones remaining memory. In order for the kernel to satisfy
> the memory allocation for those processes it needs to page out some of
> the contents of the contentious zone, which can have dramatic
> performance implications due to cache misses, etc. I see two ways an
> operator could alleviate these issues:
>
> Set the vm.zone_reclaim_mode sysctl setting to 0, along with prefixing
> ceph-osd daemons with "numactl --interleave=all". This should probably
> be activated by a flag in /etc/default/ceph and modifying the
> ceph-osd.conf upstart script, along with adding a depend to the ceph
> package's debian/rules file on the "numactl" package.
>
> The alternative is to use a cgroup for each ceph-osd daemon, pinning
> each one to cores in the same NUMA zone using cpuset.cpu and
> cpuset.mems. This would probably also live in /etc/default/ceph and
> the upstart scripts.
>
> --
> Kyle
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com