MDS cache is too large and crashes

Sake Ceph <ceph@xxxxxxxxxxx> · Fri, 21 Jul 2023 09:42:17 +0200 (CEST)

At 01:27 this morning I received the first email about MDS cache is too large (mailing happens every 15 minutes if something happens). Looking into it, it was again a standby-replay host which stops working.

At 01:00 a few rsync processes start in parallel on a client machine. This copies data from a NFS share to Cephfs share to sync the latest changes. (we want to switch to Cephfs in the near future).

This crashing of the standby-replay mds happend a couple times now, so I think it would be good to get some help. Where should I look next?

Some cephfs information
----------------------------------
# ceph fs status
atlassian-opl - 8 clients
=============
RANK      STATE                     MDS                    ACTIVITY     DNS    INOS   DIRS   CAPS
 0        active      atlassian-opl.mds5.zsxfep  Reqs:    0 /s  7830   7803    635   3706
0-s   standby-replay  atlassian-opl.mds6.svvuii  Evts:    0 /s  3139   1924    461      0
           POOL              TYPE     USED  AVAIL
cephfs.atlassian-opl.meta  metadata  2186M  1161G
cephfs.atlassian-opl.data    data    23.0G  1161G
atlassian-prod - 12 clients
==============
RANK      STATE                      MDS                    ACTIVITY     DNS    INOS   DIRS   CAPS
 0        active      atlassian-prod.mds1.msydxf  Reqs:    0 /s  2703k  2703k   905k  1585
 1        active      atlassian-prod.mds2.oappgu  Reqs:    0 /s   961k   961k   317k   622
 2        active      atlassian-prod.mds3.yvkjsi  Reqs:    0 /s  2083k  2083k   670k   443
0-s   standby-replay  atlassian-prod.mds4.qlvypn  Evts:    0 /s   352k   352k   102k     0
1-s   standby-replay  atlassian-prod.mds5.egsdfl  Evts:    0 /s   873k   873k   277k     0
2-s   standby-replay  atlassian-prod.mds6.ghonso  Evts:    0 /s  2317k  2316k   679k     0
           POOL               TYPE     USED  AVAIL
cephfs.atlassian-prod.meta  metadata  58.8G  1161G
cephfs.atlassian-prod.data    data    5492G  1161G
MDS version: ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable)

When looking at the log on the MDS server, I've got the following:
2023-07-21T01:21:01.942+0000 7f668a5e0700 -1 received  signal: Hangup from Kernel ( Could be generated by pthread_kill(), raise(), abort(), alarm() ) UID: 0
2023-07-21T01:23:13.856+0000 7f6688ddd700  1 mds.atlassian-prod.pwsoel13143.qlvypn Updating MDS map to version 5671 from mon.1
2023-07-21T01:23:18.369+0000 7f6688ddd700  1 mds.atlassian-prod.pwsoel13143.qlvypn Updating MDS map to version 5672 from mon.1
2023-07-21T01:23:31.719+0000 7f6688ddd700  1 mds.atlassian-prod.pwsoel13143.qlvypn Updating MDS map to version 5673 from mon.1
2023-07-21T01:23:35.769+0000 7f6688ddd700  1 mds.atlassian-prod.pwsoel13143.qlvypn Updating MDS map to version 5674 from mon.1
2023-07-21T01:28:23.764+0000 7f6688ddd700  1 mds.atlassian-prod.pwsoel13143.qlvypn Updating MDS map to version 5675 from mon.1
2023-07-21T01:29:13.657+0000 7f6688ddd700  1 mds.atlassian-prod.pwsoel13143.qlvypn Updating MDS map to version 5676 from mon.1
2023-07-21T01:33:43.886+0000 7f6688ddd700  1 mds.atlassian-prod.pwsoel13143.qlvypn Updating MDS map to version 5677 from mon.1
(and another 20 lines about updating MDS map)

Alert mailings:
Mail at 01:27
----------------------------------
HEALTH_WARN

--- New ---
[WARN] MDS_CACHE_OVERSIZED: 1 MDSs report oversized cache
        mds.atlassian-prod.mds4.qlvypn(mds.0): MDS cache is too large (13GB/9GB); 0 inodes in use by clients, 0 stray files

=== Full health status ===
[WARN] MDS_CACHE_OVERSIZED: 1 MDSs report oversized cache
        mds.atlassian-prod.mds4.qlvypn(mds.0): MDS cache is too large (13GB/9GB); 0 inodes in use by clients, 0 stray files

Mail at 03:27
----------------------------------
HEALTH_OK

--- Cleared ---
[WARN] MDS_CACHE_OVERSIZED: 1 MDSs report oversized cache
        mds.atlassian-prod.mds4.qlvypn(mds.0): MDS cache is too large (14GB/9GB); 0 inodes in use by clients, 0 stray files

=== Full health status ===

Mail at 04:12
----------------------------------
HEALTH_WARN

--- New ---
[WARN] MDS_CACHE_OVERSIZED: 1 MDSs report oversized cache
        mds.atlassian-prod.mds4.qlvypn(mds.0): MDS cache is too large (15GB/9GB); 0 inodes in use by clients, 0 stray files

=== Full health status ===
[WARN] MDS_CACHE_OVERSIZED: 1 MDSs report oversized cache
        mds.atlassian-prod.mds4.qlvypn(mds.0): MDS cache is too large (15GB/9GB); 0 inodes in use by clients, 0 stray files

Best regards,
Sake
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx