Re: NUMA and ceph

Mark Nelson <mark.nelson@xxxxxxxxxxx> · Thu, 12 Dec 2013 22:01:22 -0600

On 12/12/2013 04:30 PM, Kyle Bader wrote:
It seems that NUMA can be problematic for ceph-osd daemons in certain
circumstances. Namely it seems that if a NUMA zone is running out of
memory due to uneven allocation it is possible for a NUMA zone to
enter reclaim mode when threads/processes are scheduled on a core in
that zone and those processes are request memory allocations greater
than the zones remaining memory. In order for the kernel to satisfy
the memory allocation for those processes it needs to page out some of
the contents of the contentious zone, which can have dramatic
performance implications due to cache misses, etc. I see two ways an
operator could alleviate these issues:

Yes, quite possibly I think, though I'd be curious to see what impact 
testing would show on modern dual socket Intel boxes.  I suspect this 
could especially be an issue on quad socket AMD boxes though, especially 
Magnycours era.

Set the vm.zone_reclaim_mode sysctl setting to 0, along with prefixing
ceph-osd daemons with "numactl --interleave=all". This should probably
be activated by a flag in /etc/default/ceph and modifying the
ceph-osd.conf upstart script, along with adding a depend to the ceph
package's debian/rules file on the "numactl" package.

The alternative is to use a cgroup for each ceph-osd daemon, pinning
each one to cores in the same NUMA zone using cpuset.cpu and
cpuset.mems. This would probably also live in /etc/default/ceph and
the upstart scripts.

Seems reasonable unless we are testing OSDs that we (eventually?) want 
to have utilize cores on multiple sockets.  If possible, pinning the OSD 
to whatever CPU has the the associated PCIE bus and NIC would be ideal, 
though there's no real good automated way to do that yet afaik.

Mark
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com