Re: Issue with CephFS (mds stuck in clientreplay status) since upgrade to 18.2.0.

"Lo Re Giuseppe" <giuseppe.lore@xxxxxxx> · Mon, 27 Nov 2023 10:56:38 +0000

Hi David,
Thanks a lot for your reply.
Yes we have heavy load from clients on the same subtree. We have multiple MDSs that were setup with the hope to distribute the load among them, but this is not really happening, in moments of high load we see most of the load on one MDS.
We don't use pinning today.
We have placed >1 MDSs in the same servers as we noticed that the memory consumption was allowing this.
Right now I have scaled down the MDS services  to 1 as I have learnt that the use of multiple MDSs could have been a risky move. Though it was working not bad until we upgraded from 17.2.5 up to 17.2.7 and now 18.2.0.
I'll look more in the client stability as per your suggestion.

Best,

Giuseppe

On 27.11.2023, 10:41, "David C." <david.casier@xxxxxxxx <mailto:david.casier@xxxxxxxx>> wrote:

Hi Guiseppe,

Wouldn't you have clients who heavily load the MDS with concurrent access
on the same trees ?

Perhaps, also, look at the stability of all your clients (even if there are
many) [dmesg -T, ...]

How are your 4 active MDS configured (pinning?) ?

Probably nothing to do but normal for 2 MDS to be on the same host
"monitor-02" ?

________________________________________________________

Cordialement,

*David CASIER*

________________________________________________________

Le lun. 27 nov. 2023 à 10:09, Lo Re Giuseppe <giuseppe.lore@xxxxxxx <mailto:giuseppe.lore@xxxxxxx>> a
écrit :

> Hi,
> We have upgraded one ceph cluster from 17.2.7 to 18.2.0. Since then we are
> having CephFS issues.
> For example this morning:
> “””
> [root@naret-monitor01 ~]# ceph -s
> cluster:
> id: 63334166-d991-11eb-99de-40a6b72108d0
> health: HEALTH_WARN
> 1 filesystem is degraded
> 3 clients failing to advance oldest client/flush tid
> 3 MDSs report slow requests
> 6 pgs not scrubbed in time
> 29 daemons have recently crashed
> …
> “””
>
> The ceph orch, ceph crash and ceph fs status commands were hanging.
>
> After a “ceph mgr fail” those commands started to respond.
> Then I have noticed that there was one mds with most of the slow
> operations,
>
> “””
> [WRN] MDS_SLOW_REQUEST: 3 MDSs report slow requests
> mds.cephfs.naret-monitor01.nuakzo(mds.0): 18 slow requests are blocked
> > 30 secs
> mds.cephfs.naret-monitor01.uvevbf(mds.1): 1683 slow requests are
> blocked > 30 secs
> mds.cephfs.naret-monitor02.exceuo(mds.2): 1 slow requests are blocked
> > 30 secs
> “””
>
> Then I tried to restart it with
>
> “””
> [root@naret-monitor01 ~]# ceph orch daemon restart
> mds.cephfs.naret-monitor01.uvevbf
> Scheduled to restart mds.cephfs.naret-monitor01.uvevbf on host
> 'naret-monitor01'
> “””
>
> After the cephfs entered into this situation:
> “””
> [root@naret-monitor01 ~]# ceph fs status
> cephfs - 198 clients
> ======
> RANK STATE MDS ACTIVITY DNS
> INOS DIRS CAPS
> 0 active cephfs.naret-monitor01.nuakzo Reqs: 0 /s 17.2k
> 16.2k 1892 14.3k
> 1 active cephfs.naret-monitor02.ztdghf Reqs: 0 /s 28.1k
> 10.3k 752 6881
> 2 clientreplay cephfs.naret-monitor02.exceuo 63.0k
> 6491 541 66
> 3 active cephfs.naret-monitor03.lqppte Reqs: 0 /s 16.7k
> 13.4k 8233 990
> POOL TYPE USED AVAIL
> cephfs.cephfs.meta metadata 5888M 18.5T
> cephfs.cephfs.data data 119G 215T
> cephfs.cephfs.data.e_4_2 data 2289G 3241T
> cephfs.cephfs.data.e_8_3 data 9997G 470T
> STANDBY MDS
> cephfs.naret-monitor03.eflouf
> cephfs.naret-monitor01.uvevbf
> MDS version: ceph version 18.2.0
> (5dd24139a1eada541a3bc16b6941c5dde975e26d) reef (stable)
> “””
>
> The file system is totally unresponsive (we can mount it on client nodes
> but any operations like a simple ls hangs).
>
> During the night we had a lot of mds crashes, I can share the content.
>
> Does anybody have an idea on how to tackle this problem?
>
> Best,
>
> Giuseppe
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx <mailto:ceph-users@xxxxxxx>
> To unsubscribe send an email to ceph-users-leave@xxxxxxx <mailto:ceph-users-leave@xxxxxxx>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx <mailto:ceph-users@xxxxxxx>
To unsubscribe send an email to ceph-users-leave@xxxxxxx <mailto:ceph-users-leave@xxxxxxx>

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx