Hi Giuseppe, There are likely one or two clients whose op is blocking the reconnect/replay. If you increase debug_mds perhaps you can find the guilty client and disconnect it / block it from mounting. Or for a more disruptive recovery you can try this "Deny all reconnect to clients " option: https://docs.ceph.com/en/reef/cephfs/troubleshooting/#avoiding-recovery-roadblocks Hope that helps, Dan On Mon, Nov 27, 2023 at 1:08 AM Lo Re Giuseppe <giuseppe.lore@xxxxxxx> wrote: > > Hi, > We have upgraded one ceph cluster from 17.2.7 to 18.2.0. Since then we are having CephFS issues. > For example this morning: > “”” > [root@naret-monitor01 ~]# ceph -s > cluster: > id: 63334166-d991-11eb-99de-40a6b72108d0 > health: HEALTH_WARN > 1 filesystem is degraded > 3 clients failing to advance oldest client/flush tid > 3 MDSs report slow requests > 6 pgs not scrubbed in time > 29 daemons have recently crashed > … > “”” > > The ceph orch, ceph crash and ceph fs status commands were hanging. > > After a “ceph mgr fail” those commands started to respond. > Then I have noticed that there was one mds with most of the slow operations, > > “”” > [WRN] MDS_SLOW_REQUEST: 3 MDSs report slow requests > mds.cephfs.naret-monitor01.nuakzo(mds.0): 18 slow requests are blocked > 30 secs > mds.cephfs.naret-monitor01.uvevbf(mds.1): 1683 slow requests are blocked > 30 secs > mds.cephfs.naret-monitor02.exceuo(mds.2): 1 slow requests are blocked > 30 secs > “”” > > Then I tried to restart it with > > “”” > [root@naret-monitor01 ~]# ceph orch daemon restart mds.cephfs.naret-monitor01.uvevbf > Scheduled to restart mds.cephfs.naret-monitor01.uvevbf on host 'naret-monitor01' > “”” > > After the cephfs entered into this situation: > “”” > [root@naret-monitor01 ~]# ceph fs status > cephfs - 198 clients > ====== > RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS > 0 active cephfs.naret-monitor01.nuakzo Reqs: 0 /s 17.2k 16.2k 1892 14.3k > 1 active cephfs.naret-monitor02.ztdghf Reqs: 0 /s 28.1k 10.3k 752 6881 > 2 clientreplay cephfs.naret-monitor02.exceuo 63.0k 6491 541 66 > 3 active cephfs.naret-monitor03.lqppte Reqs: 0 /s 16.7k 13.4k 8233 990 > POOL TYPE USED AVAIL > cephfs.cephfs.meta metadata 5888M 18.5T > cephfs.cephfs.data data 119G 215T > cephfs.cephfs.data.e_4_2 data 2289G 3241T > cephfs.cephfs.data.e_8_3 data 9997G 470T > STANDBY MDS > cephfs.naret-monitor03.eflouf > cephfs.naret-monitor01.uvevbf > MDS version: ceph version 18.2.0 (5dd24139a1eada541a3bc16b6941c5dde975e26d) reef (stable) > “”” > > The file system is totally unresponsive (we can mount it on client nodes but any operations like a simple ls hangs). > > During the night we had a lot of mds crashes, I can share the content. > > Does anybody have an idea on how to tackle this problem? > > Best, > > Giuseppe > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx