On 12/12/2013 04:30 PM, Kyle Bader wrote:
It seems that NUMA can be problematic for ceph-osd daemons in certain circumstances. Namely it seems that if a NUMA zone is running out of memory due to uneven allocation it is possible for a NUMA zone to enter reclaim mode when threads/processes are scheduled on a core in that zone and those processes are request memory allocations greater than the zones remaining memory. In order for the kernel to satisfy the memory allocation for those processes it needs to page out some of the contents of the contentious zone, which can have dramatic performance implications due to cache misses, etc. I see two ways an operator could alleviate these issues:
Yes, quite possibly I think, though I'd be curious to see what impact testing would show on modern dual socket Intel boxes. I suspect this could especially be an issue on quad socket AMD boxes though, especially Magnycours era.
Set the vm.zone_reclaim_mode sysctl setting to 0, along with prefixing ceph-osd daemons with "numactl --interleave=all". This should probably be activated by a flag in /etc/default/ceph and modifying the ceph-osd.conf upstart script, along with adding a depend to the ceph package's debian/rules file on the "numactl" package. The alternative is to use a cgroup for each ceph-osd daemon, pinning each one to cores in the same NUMA zone using cpuset.cpu and cpuset.mems. This would probably also live in /etc/default/ceph and the upstart scripts.
Seems reasonable unless we are testing OSDs that we (eventually?) want to have utilize cores on multiple sockets. If possible, pinning the OSD to whatever CPU has the the associated PCIE bus and NIC would be ideal, though there's no real good automated way to do that yet afaik.
Mark _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com