Re: Issue with CephFS (mds stuck in clientreplay status) since upgrade to 18.2.0.

Dan van der Ster <dan.vanderster@xxxxxxxxx> · Mon, 27 Nov 2023 10:26:23 -0800

Hi Giuseppe,

There are likely one or two clients whose op is blocking the reconnect/replay.
If you increase debug_mds perhaps you can find the guilty client and
disconnect it / block it from mounting.

Or for a more disruptive recovery you can try this "Deny all reconnect
to clients " option:
https://docs.ceph.com/en/reef/cephfs/troubleshooting/#avoiding-recovery-roadblocks

Hope that helps,

Dan

On Mon, Nov 27, 2023 at 1:08 AM Lo Re Giuseppe <giuseppe.lore@xxxxxxx> wrote:
>
> Hi,
> We have upgraded one ceph cluster from 17.2.7 to 18.2.0. Since then we are having CephFS issues.
> For example this morning:
> “””
> [root@naret-monitor01 ~]# ceph -s
>   cluster:
>     id:     63334166-d991-11eb-99de-40a6b72108d0
>     health: HEALTH_WARN
>             1 filesystem is degraded
>             3 clients failing to advance oldest client/flush tid
>             3 MDSs report slow requests
>             6 pgs not scrubbed in time
>             29 daemons have recently crashed
> …
> “””
>
> The ceph orch, ceph crash and ceph fs status commands were hanging.
>
> After a “ceph mgr fail” those commands started to respond.
> Then I have noticed that there was one mds with most of the slow operations,
>
> “””
> [WRN] MDS_SLOW_REQUEST: 3 MDSs report slow requests
>     mds.cephfs.naret-monitor01.nuakzo(mds.0): 18 slow requests are blocked > 30 secs
>     mds.cephfs.naret-monitor01.uvevbf(mds.1): 1683 slow requests are blocked > 30 secs
>     mds.cephfs.naret-monitor02.exceuo(mds.2): 1 slow requests are blocked > 30 secs
> “””
>
> Then I tried to restart it with
>
> “””
> [root@naret-monitor01 ~]# ceph orch daemon restart mds.cephfs.naret-monitor01.uvevbf
> Scheduled to restart mds.cephfs.naret-monitor01.uvevbf on host 'naret-monitor01'
> “””
>
> After the cephfs entered into this situation:
> “””
> [root@naret-monitor01 ~]# ceph fs status
> cephfs - 198 clients
> ======
> RANK     STATE                   MDS                  ACTIVITY     DNS    INOS   DIRS   CAPS
> 0       active     cephfs.naret-monitor01.nuakzo  Reqs:    0 /s  17.2k  16.2k  1892   14.3k
> 1       active     cephfs.naret-monitor02.ztdghf  Reqs:    0 /s  28.1k  10.3k   752   6881
> 2    clientreplay  cephfs.naret-monitor02.exceuo                 63.0k  6491    541     66
> 3       active     cephfs.naret-monitor03.lqppte  Reqs:    0 /s  16.7k  13.4k  8233    990
>           POOL              TYPE     USED  AVAIL
>    cephfs.cephfs.meta     metadata  5888M  18.5T
>    cephfs.cephfs.data       data     119G   215T
> cephfs.cephfs.data.e_4_2    data    2289G  3241T
> cephfs.cephfs.data.e_8_3    data    9997G   470T
>          STANDBY MDS
> cephfs.naret-monitor03.eflouf
> cephfs.naret-monitor01.uvevbf
> MDS version: ceph version 18.2.0 (5dd24139a1eada541a3bc16b6941c5dde975e26d) reef (stable)
> “””
>
> The file system is totally unresponsive (we can mount it on client nodes but any operations like a simple ls hangs).
>
> During the night we had a lot of mds crashes, I can share the content.
>
> Does anybody have an idea on how to tackle this problem?
>
> Best,
>
> Giuseppe
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx