Hi Alessandro, What is the state of your PGs? Inactive PGs have blocked CephFS recovery on our cluster before. I'd try to clear any blocked ops and see if the MDSes recover. --Lincoln On Mon, 2018-01-08 at 17:21 +0100, Alessandro De Salvo wrote: > Hi, > > I'm running on ceph luminous 12.2.2 and my cephfs suddenly degraded. > > I have 2 active mds instances and 1 standby. All the active > instances > are now in replay state and show the same error in the logs: > > > ---- mds1 ---- > > 2018-01-08 16:04:15.765637 7fc2e92451c0 0 ceph version 12.2.2 > (cf0baeeeeba3b47f9427c6c97e2144b094b7e5ba) luminous (stable), > process > (unknown), pid 164 > starting mds.mds1 at - > 2018-01-08 16:04:15.785849 7fc2e92451c0 0 pidfile_write: ignore > empty > --pid-file > 2018-01-08 16:04:20.168178 7fc2e1ee1700 1 mds.mds1 handle_mds_map > standby > 2018-01-08 16:04:20.278424 7fc2e1ee1700 1 mds.1.20635 handle_mds_map > i > am now mds.1.20635 > 2018-01-08 16:04:20.278432 7fc2e1ee1700 1 mds.1.20635 > handle_mds_map > state change up:boot --> up:replay > 2018-01-08 16:04:20.278443 7fc2e1ee1700 1 mds.1.20635 replay_start > 2018-01-08 16:04:20.278449 7fc2e1ee1700 1 mds.1.20635 recovery set > is 0 > 2018-01-08 16:04:20.278458 7fc2e1ee1700 1 mds.1.20635 waiting for > osdmap 21467 (which blacklists prior instance) > > > ---- mds2 ---- > > 2018-01-08 16:04:16.870459 7fd8456201c0 0 ceph version 12.2.2 > (cf0baeeeeba3b47f9427c6c97e2144b094b7e5ba) luminous (stable), > process > (unknown), pid 295 > starting mds.mds2 at - > 2018-01-08 16:04:16.881616 7fd8456201c0 0 pidfile_write: ignore > empty > --pid-file > 2018-01-08 16:04:21.274543 7fd83e2bc700 1 mds.mds2 handle_mds_map > standby > 2018-01-08 16:04:21.314438 7fd83e2bc700 1 mds.0.20637 handle_mds_map > i > am now mds.0.20637 > 2018-01-08 16:04:21.314459 7fd83e2bc700 1 mds.0.20637 > handle_mds_map > state change up:boot --> up:replay > 2018-01-08 16:04:21.314479 7fd83e2bc700 1 mds.0.20637 replay_start > 2018-01-08 16:04:21.314492 7fd83e2bc700 1 mds.0.20637 recovery set > is 1 > 2018-01-08 16:04:21.314517 7fd83e2bc700 1 mds.0.20637 waiting for > osdmap 21467 (which blacklists prior instance) > 2018-01-08 16:04:21.393307 7fd837aaf700 0 mds.0.cache creating > system > inode with ino:0x100 > 2018-01-08 16:04:21.397246 7fd837aaf700 0 mds.0.cache creating > system > inode with ino:0x1 > > The cluster is recovering as we are changing some of the osds, and > there > are a few slow/stuck requests, but I'm not sure if this is the cause, > as > there is apparently no data loss (until now). > > How can I force the MDSes to quit the replay state? > > Thanks for any help, > > > Alessandro > > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com