CephFS log jam prevention

Reed Dier <reed.dier@xxxxxxxxxxx> · Tue, 5 Dec 2017 10:07:19 -0600

Been trying to do a fairly large rsync onto a 3x replicated, filestore HDD backed CephFS pool.
Luminous 12.2.1 for all daemons, kernel CephFS driver, Ubuntu 16.04 running mix of 4.8 and 4.10 kernels, 2x10GbE networking between all daemons and clients.

$ ceph versions
{
    "mon": {
        "ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) luminous (stable)": 3
    },
    "mgr": {
        "ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) luminous (stable)": 3
    },
    "osd": {
        "ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) luminous (stable)": 74
    },
    "mds": {
        "ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) luminous (stable)": 2
    },
    "overall": {
        "ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) luminous (stable)": 82
    }
}

HEALTH_ERR 1 MDSs report oversized cache; 1 MDSs have many clients failing to respond to cache pressure; 1 MDSs behind on tr
imming; noout,nodeep-scrub flag(s) set; application not enabled on 1 pool(s); 242 slow requests are blocked > 32 sec
; 769378 stuck requests are blocked > 4096 sec
MDS_CACHE_OVERSIZED 1 MDSs report oversized cache
    mdsdb(mds.0): MDS cache is too large (23GB/8GB); 1018 inodes in use by clients, 1 stray files
MDS_CLIENT_RECALL_MANY 1 MDSs have many clients failing to respond to cache pressure
    mdsdb(mds.0): Many clients (37) failing to respond to cache pressureclient_count: 37
MDS_TRIM 1 MDSs behind on trimming
    mdsdb(mds.0): Behind on trimming (36252/30)max_segments: 30, num_segments: 36252
OSDMAP_FLAGS noout,nodeep-scrub flag(s) set
REQUEST_SLOW 242 slow requests are blocked > 32 sec
    236 ops are blocked > 2097.15 sec
    3 ops are blocked > 1048.58 sec
    2 ops are blocked > 524.288 sec
    1 ops are blocked > 32.768 sec
REQUEST_STUCK 769378 stuck requests are blocked > 4096 sec
    91 ops are blocked > 67108.9 sec
    121258 ops are blocked > 33554.4 sec
    308189 ops are blocked > 16777.2 sec
    251586 ops are blocked > 8388.61 sec
    88254 ops are blocked > 4194.3 sec
    osds 0,1,3,6,8,12,15,16,17,21,22,23 have stuck requests > 16777.2 sec
    osds 4,7,9,10,11,14,18,20 have stuck requests > 33554.4 sec
    osd.13 has stuck requests > 67108.9 sec

This is across 8 nodes, holding 3x 8TB HDD’s each, all backed by Intel P3600 NVMe drives for journaling.
Removed SSD OSD’s for brevity.

$ ceph osd tree
ID  CLASS WEIGHT    TYPE NAME                         STATUS REWEIGHT PRI-AFF
-13        87.28799 root ssd
 -1       174.51500 root default
-10       174.51500     rack default.rack2
-55        43.62000         chassis node2425
 -2        21.81000             host node24
  0   hdd   7.26999                 osd.0                 up  1.00000 1.00000
  8   hdd   7.26999                 osd.8                 up  1.00000 1.00000
 16   hdd   7.26999                 osd.16                up  1.00000 1.00000
 -3        21.81000             host node25
  1   hdd   7.26999                 osd.1                 up  1.00000 1.00000
  9   hdd   7.26999                 osd.9                 up  1.00000 1.00000
 17   hdd   7.26999                 osd.17                up  1.00000 1.00000
-56        43.63499         chassis node2627
 -4        21.81999             host node26
  2   hdd   7.27499                 osd.2                 up  1.00000 1.00000
 10   hdd   7.26999                 osd.10                up  1.00000 1.00000
 18   hdd   7.27499                 osd.18                up  1.00000 1.00000
 -5        21.81499             host node27
  3   hdd   7.26999                 osd.3                 up  1.00000 1.00000
 11   hdd   7.26999                 osd.11                up  1.00000 1.00000
 19   hdd   7.27499                 osd.19                up  1.00000 1.00000
-57        43.62999         chassis node2829
 -6        21.81499             host node28
  4   hdd   7.26999                 osd.4                 up  1.00000 1.00000
 12   hdd   7.26999                 osd.12                up  1.00000 1.00000
 20   hdd   7.27499                 osd.20                up  1.00000 1.00000
 -7        21.81499             host node29
  5   hdd   7.26999                 osd.5                 up  1.00000 1.00000
 13   hdd   7.26999                 osd.13                up  1.00000 1.00000
 21   hdd   7.27499                 osd.21                up  1.00000 1.00000
-58        43.62999         chassis node3031
 -8        21.81499             host node30
  6   hdd   7.26999                 osd.6                 up  1.00000 1.00000
 14   hdd   7.26999                 osd.14                up  1.00000 1.00000
 22   hdd   7.27499                 osd.22                up  1.00000 1.00000
 -9        21.81499             host node31
  7   hdd   7.26999                 osd.7                 up  1.00000 1.00000
 15   hdd   7.26999                 osd.15                up  1.00000 1.00000
 23   hdd   7.27499                 osd.23                up  1.00000 1.00000

Trying to figure out what in my configuration is off, because I am told that CephFS should be able to throttle the requests to match the underlying storage medium and not create such an extensive log jam. 

[mds]
mds_cache_size = 0
mds_cache_memory_limit = 8589934592

[osd]
osd_op_threads = 4
filestore max sync interval = 30
osd_max_backfills = 10
osd_recovery_max_active = 16
osd_op_thread_suicide_timeout = 600

I originally had the mds_cache_size set to 10000000 from Jewel, but read that it is better to 0 that and set limits in the mds_cache_memory_limit now. So I set that to 8GB to see if that helped any.

Because I haven’t seen anything less than I believe 4.13 kernel for the Luminous capabilities CephFS kernel driver, everything is using Jewel capabilities for CephFS.

$ ceph features
{
    "mon": {
        "group": {
            "features": "0x1ffddff8eea4fffb",
            "release": "luminous",
            "num": 3
        }
    },
    "mds": {
        "group": {
            "features": "0x1ffddff8eea4fffb",
            "release": "luminous",
            "num": 2
        }
    },
    "osd": {
        "group": {
            "features": "0x1ffddff8eea4fffb",
            "release": "luminous",
            "num": 74
        }
    },
    "client": {
        "group": {
            "features": "0x107b84a842aca",
            "release": "hammer",
            "num": 2
        },
        "group": {
            "features": "0x40107b86a842ada",
            "release": "jewel",
            "num": 39
        },
        "group": {
            "features": "0x7010fb86aa42ada",
            "release": "jewel",
            "num": 1
        },
        "group": {
            "features": "0x1ffddff8eea4fffb",
            "release": "luminous",
            "num": 189
        }
    }
}

Any help is appreciated.

Thanks,

Reed

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com