Issue with CephFS (mds stuck in clientreplay status) since upgrade to 18.2.0.

"Lo Re Giuseppe" <giuseppe.lore@xxxxxxx> · Mon, 27 Nov 2023 09:07:52 +0000

Hi,
We have upgraded one ceph cluster from 17.2.7 to 18.2.0. Since then we are having CephFS issues.
For example this morning:
“””
[root@naret-monitor01 ~]# ceph -s
  cluster:
    id:     63334166-d991-11eb-99de-40a6b72108d0
    health: HEALTH_WARN
            1 filesystem is degraded
            3 clients failing to advance oldest client/flush tid
            3 MDSs report slow requests
            6 pgs not scrubbed in time
            29 daemons have recently crashed
…
“””

The ceph orch, ceph crash and ceph fs status commands were hanging.

After a “ceph mgr fail” those commands started to respond.
Then I have noticed that there was one mds with most of the slow operations,

“””
[WRN] MDS_SLOW_REQUEST: 3 MDSs report slow requests
    mds.cephfs.naret-monitor01.nuakzo(mds.0): 18 slow requests are blocked > 30 secs
    mds.cephfs.naret-monitor01.uvevbf(mds.1): 1683 slow requests are blocked > 30 secs
    mds.cephfs.naret-monitor02.exceuo(mds.2): 1 slow requests are blocked > 30 secs
“””

Then I tried to restart it with

“””
[root@naret-monitor01 ~]# ceph orch daemon restart mds.cephfs.naret-monitor01.uvevbf
Scheduled to restart mds.cephfs.naret-monitor01.uvevbf on host 'naret-monitor01'
“””

After the cephfs entered into this situation:
“””
[root@naret-monitor01 ~]# ceph fs status
cephfs - 198 clients
======
RANK     STATE                   MDS                  ACTIVITY     DNS    INOS   DIRS   CAPS
0       active     cephfs.naret-monitor01.nuakzo  Reqs:    0 /s  17.2k  16.2k  1892   14.3k
1       active     cephfs.naret-monitor02.ztdghf  Reqs:    0 /s  28.1k  10.3k   752   6881
2    clientreplay  cephfs.naret-monitor02.exceuo                 63.0k  6491    541     66
3       active     cephfs.naret-monitor03.lqppte  Reqs:    0 /s  16.7k  13.4k  8233    990
          POOL              TYPE     USED  AVAIL
   cephfs.cephfs.meta     metadata  5888M  18.5T
   cephfs.cephfs.data       data     119G   215T
cephfs.cephfs.data.e_4_2    data    2289G  3241T
cephfs.cephfs.data.e_8_3    data    9997G   470T
         STANDBY MDS
cephfs.naret-monitor03.eflouf
cephfs.naret-monitor01.uvevbf
MDS version: ceph version 18.2.0 (5dd24139a1eada541a3bc16b6941c5dde975e26d) reef (stable)
“””

The file system is totally unresponsive (we can mount it on client nodes but any operations like a simple ls hangs).

During the night we had a lot of mds crashes, I can share the content.

Does anybody have an idea on how to tackle this problem?

Best,

Giuseppe
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx