Re: Issue with CephFS (mds stuck in clientreplay status) since upgrade to 18.2.0.

"David C." <david.casier@xxxxxxxx> · Mon, 27 Nov 2023 10:31:15 +0100

Hi Guiseppe,

Wouldn't you have clients who heavily load the MDS with concurrent access
on the same trees ?

Perhaps, also, look at the stability of all your clients (even if there are
many) [dmesg -T, ...]

How are your 4 active MDS configured (pinning?) ?

Probably nothing to do but normal for 2 MDS to be on the same host
"monitor-02" ?

________________________________________________________

Cordialement,

*David CASIER*

________________________________________________________

Le lun. 27 nov. 2023 à 10:09, Lo Re Giuseppe <giuseppe.lore@xxxxxxx> a
écrit :

> Hi,
> We have upgraded one ceph cluster from 17.2.7 to 18.2.0. Since then we are
> having CephFS issues.
> For example this morning:
> “””
> [root@naret-monitor01 ~]# ceph -s
>   cluster:
>     id:     63334166-d991-11eb-99de-40a6b72108d0
>     health: HEALTH_WARN
>             1 filesystem is degraded
>             3 clients failing to advance oldest client/flush tid
>             3 MDSs report slow requests
>             6 pgs not scrubbed in time
>             29 daemons have recently crashed
> …
> “””
>
> The ceph orch, ceph crash and ceph fs status commands were hanging.
>
> After a “ceph mgr fail” those commands started to respond.
> Then I have noticed that there was one mds with most of the slow
> operations,
>
> “””
> [WRN] MDS_SLOW_REQUEST: 3 MDSs report slow requests
>     mds.cephfs.naret-monitor01.nuakzo(mds.0): 18 slow requests are blocked
> > 30 secs
>     mds.cephfs.naret-monitor01.uvevbf(mds.1): 1683 slow requests are
> blocked > 30 secs
>     mds.cephfs.naret-monitor02.exceuo(mds.2): 1 slow requests are blocked
> > 30 secs
> “””
>
> Then I tried to restart it with
>
> “””
> [root@naret-monitor01 ~]# ceph orch daemon restart
> mds.cephfs.naret-monitor01.uvevbf
> Scheduled to restart mds.cephfs.naret-monitor01.uvevbf on host
> 'naret-monitor01'
> “””
>
> After the cephfs entered into this situation:
> “””
> [root@naret-monitor01 ~]# ceph fs status
> cephfs - 198 clients
> ======
> RANK     STATE                   MDS                  ACTIVITY     DNS
> INOS   DIRS   CAPS
> 0       active     cephfs.naret-monitor01.nuakzo  Reqs:    0 /s  17.2k
> 16.2k  1892   14.3k
> 1       active     cephfs.naret-monitor02.ztdghf  Reqs:    0 /s  28.1k
> 10.3k   752   6881
> 2    clientreplay  cephfs.naret-monitor02.exceuo                 63.0k
> 6491    541     66
> 3       active     cephfs.naret-monitor03.lqppte  Reqs:    0 /s  16.7k
> 13.4k  8233    990
>           POOL              TYPE     USED  AVAIL
>    cephfs.cephfs.meta     metadata  5888M  18.5T
>    cephfs.cephfs.data       data     119G   215T
> cephfs.cephfs.data.e_4_2    data    2289G  3241T
> cephfs.cephfs.data.e_8_3    data    9997G   470T
>          STANDBY MDS
> cephfs.naret-monitor03.eflouf
> cephfs.naret-monitor01.uvevbf
> MDS version: ceph version 18.2.0
> (5dd24139a1eada541a3bc16b6941c5dde975e26d) reef (stable)
> “””
>
> The file system is totally unresponsive (we can mount it on client nodes
> but any operations like a simple ls hangs).
>
> During the night we had a lot of mds crashes, I can share the content.
>
> Does anybody have an idea on how to tackle this problem?
>
> Best,
>
> Giuseppe
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx