Hi Guiseppe, Wouldn't you have clients who heavily load the MDS with concurrent access on the same trees ? Perhaps, also, look at the stability of all your clients (even if there are many) [dmesg -T, ...] How are your 4 active MDS configured (pinning?) ? Probably nothing to do but normal for 2 MDS to be on the same host "monitor-02" ? ________________________________________________________ Cordialement, *David CASIER* ________________________________________________________ Le lun. 27 nov. 2023 à 10:09, Lo Re Giuseppe <giuseppe.lore@xxxxxxx> a écrit : > Hi, > We have upgraded one ceph cluster from 17.2.7 to 18.2.0. Since then we are > having CephFS issues. > For example this morning: > “”” > [root@naret-monitor01 ~]# ceph -s > cluster: > id: 63334166-d991-11eb-99de-40a6b72108d0 > health: HEALTH_WARN > 1 filesystem is degraded > 3 clients failing to advance oldest client/flush tid > 3 MDSs report slow requests > 6 pgs not scrubbed in time > 29 daemons have recently crashed > … > “”” > > The ceph orch, ceph crash and ceph fs status commands were hanging. > > After a “ceph mgr fail” those commands started to respond. > Then I have noticed that there was one mds with most of the slow > operations, > > “”” > [WRN] MDS_SLOW_REQUEST: 3 MDSs report slow requests > mds.cephfs.naret-monitor01.nuakzo(mds.0): 18 slow requests are blocked > > 30 secs > mds.cephfs.naret-monitor01.uvevbf(mds.1): 1683 slow requests are > blocked > 30 secs > mds.cephfs.naret-monitor02.exceuo(mds.2): 1 slow requests are blocked > > 30 secs > “”” > > Then I tried to restart it with > > “”” > [root@naret-monitor01 ~]# ceph orch daemon restart > mds.cephfs.naret-monitor01.uvevbf > Scheduled to restart mds.cephfs.naret-monitor01.uvevbf on host > 'naret-monitor01' > “”” > > After the cephfs entered into this situation: > “”” > [root@naret-monitor01 ~]# ceph fs status > cephfs - 198 clients > ====== > RANK STATE MDS ACTIVITY DNS > INOS DIRS CAPS > 0 active cephfs.naret-monitor01.nuakzo Reqs: 0 /s 17.2k > 16.2k 1892 14.3k > 1 active cephfs.naret-monitor02.ztdghf Reqs: 0 /s 28.1k > 10.3k 752 6881 > 2 clientreplay cephfs.naret-monitor02.exceuo 63.0k > 6491 541 66 > 3 active cephfs.naret-monitor03.lqppte Reqs: 0 /s 16.7k > 13.4k 8233 990 > POOL TYPE USED AVAIL > cephfs.cephfs.meta metadata 5888M 18.5T > cephfs.cephfs.data data 119G 215T > cephfs.cephfs.data.e_4_2 data 2289G 3241T > cephfs.cephfs.data.e_8_3 data 9997G 470T > STANDBY MDS > cephfs.naret-monitor03.eflouf > cephfs.naret-monitor01.uvevbf > MDS version: ceph version 18.2.0 > (5dd24139a1eada541a3bc16b6941c5dde975e26d) reef (stable) > “”” > > The file system is totally unresponsive (we can mount it on client nodes > but any operations like a simple ls hangs). > > During the night we had a lot of mds crashes, I can share the content. > > Does anybody have an idea on how to tackle this problem? > > Best, > > Giuseppe > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx