Re: OSD servers swapping despite having free memory capacity

Warren Wang <Warren.Wang@xxxxxxxxxxx> · Tue, 23 Jan 2018 23:44:32 +0000

Check /proc/buddyinfo for memory fragmentation. We have some pretty severe memory frag issues with Ceph to the point where we keep excessive min_free_kbytes configured (8GB), and are starting to order more memory than we actually need. If you have a lot of objects, you may find that you need to increase vfs_cache_pressure as well, to something like the default of 100.

In your buddyinfo, the columns represent the quantity of each page size available. So if you only see numbers in the first 2 columns, you only have 4K and 8K pages available, and will fail any allocations larger than that. The problem is so severe for us that we have stopped using jumbo frames due to dropped packets as a result of not being able to DMA map pages that will fit 9K frames.

In short, you might have enough memory, but not contiguous. It's even worse on RGW nodes.

Warren Wang

On 1/23/18, 2:56 PM, "ceph-users on behalf of Samuel Taylor Liston" <ceph-users-bounces@xxxxxxxxxxxxxx on behalf of sam.liston@xxxxxxxx> wrote:

    We have a 9 - node (16 - 8TB OSDs per node) running jewel on centos 7.4.  The OSDs are configured with encryption.  The cluster is accessed via two - RGWs  and there are 3 - mon servers.  The data pool is using 6+3 erasure coding.

    About 2 weeks ago I found two of the nine servers wedged and had to hard power cycle them to get them back.  In this hard reboot 22 - OSDs came back with either a corrupted encryption or data partitions.  These OSDs were removed and recreated, and the resultant rebalance moved along just fine for about a week.  At the end of that week two different nodes were unresponsive complaining of page allocation failures.  This is when I realized the nodes were heavy into swap.  These nodes were configured with 64GB of RAM as a cost saving going against the 1GB per 1TB recommendation.  We have since then doubled the RAM in each of the nodes giving each of them more than the 1GB per 1TB ratio.  

    The issue I am running into is that these nodes are still swapping; a lot, and over time becoming unresponsive, or throwing page allocation failures.  As an example, “free” will show 15GB of RAM usage (out of 128GB) and 32GB of swap.  I have configured swappiness to 0 and and also turned up the vm.min_free_kbytes to 4GB to try to keep the kernel happy, and yet I am still filling up swap.  It only occurs when the OSDs have mounted partitions and ceph-osd daemons active. 

    Anyone have an idea where this swap usage might be coming from? 
    Thanks for any insight,

    Sam Liston (sam.liston@xxxxxxxx)
    ====================================
    Center for High Performance Computing
    155 S. 1452 E. Rm 405
    Salt Lake City, Utah 84112 (801)232-6932
    ====================================

    _______________________________________________
    ceph-users mailing list
    ceph-users@xxxxxxxxxxxxxx
    http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com