Re: cephfs degraded on ceph luminous 12.2.2

Lincoln Bryant <lincolnb@xxxxxxxxxxxx> · Mon, 08 Jan 2018 10:36:03 -0600

Hi Alessandro,

What is the state of your PGs? Inactive PGs have blocked CephFS
recovery on our cluster before. I'd try to clear any blocked ops and
see if the MDSes recover.

--Lincoln

On Mon, 2018-01-08 at 17:21 +0100, Alessandro De Salvo wrote:
> Hi,
> 
> I'm running on ceph luminous 12.2.2 and my cephfs suddenly degraded.
> 
> I have 2 active mds instances and 1 standby. All the active
> instances 
> are now in replay state and show the same error in the logs:
> 
> 
> ---- mds1 ----
> 
> 2018-01-08 16:04:15.765637 7fc2e92451c0  0 ceph version 12.2.2 
> (cf0baeeeeba3b47f9427c6c97e2144b094b7e5ba) luminous (stable),
> process 
> (unknown), pid 164
> starting mds.mds1 at -
> 2018-01-08 16:04:15.785849 7fc2e92451c0  0 pidfile_write: ignore
> empty 
> --pid-file
> 2018-01-08 16:04:20.168178 7fc2e1ee1700  1 mds.mds1 handle_mds_map
> standby
> 2018-01-08 16:04:20.278424 7fc2e1ee1700  1 mds.1.20635 handle_mds_map
> i 
> am now mds.1.20635
> 2018-01-08 16:04:20.278432 7fc2e1ee1700  1 mds.1.20635
> handle_mds_map 
> state change up:boot --> up:replay
> 2018-01-08 16:04:20.278443 7fc2e1ee1700  1 mds.1.20635 replay_start
> 2018-01-08 16:04:20.278449 7fc2e1ee1700  1 mds.1.20635  recovery set
> is 0
> 2018-01-08 16:04:20.278458 7fc2e1ee1700  1 mds.1.20635  waiting for 
> osdmap 21467 (which blacklists prior instance)
> 
> 
> ---- mds2 ----
> 
> 2018-01-08 16:04:16.870459 7fd8456201c0  0 ceph version 12.2.2 
> (cf0baeeeeba3b47f9427c6c97e2144b094b7e5ba) luminous (stable),
> process 
> (unknown), pid 295
> starting mds.mds2 at -
> 2018-01-08 16:04:16.881616 7fd8456201c0  0 pidfile_write: ignore
> empty 
> --pid-file
> 2018-01-08 16:04:21.274543 7fd83e2bc700  1 mds.mds2 handle_mds_map
> standby
> 2018-01-08 16:04:21.314438 7fd83e2bc700  1 mds.0.20637 handle_mds_map
> i 
> am now mds.0.20637
> 2018-01-08 16:04:21.314459 7fd83e2bc700  1 mds.0.20637
> handle_mds_map 
> state change up:boot --> up:replay
> 2018-01-08 16:04:21.314479 7fd83e2bc700  1 mds.0.20637 replay_start
> 2018-01-08 16:04:21.314492 7fd83e2bc700  1 mds.0.20637  recovery set
> is 1
> 2018-01-08 16:04:21.314517 7fd83e2bc700  1 mds.0.20637  waiting for 
> osdmap 21467 (which blacklists prior instance)
> 2018-01-08 16:04:21.393307 7fd837aaf700  0 mds.0.cache creating
> system 
> inode with ino:0x100
> 2018-01-08 16:04:21.397246 7fd837aaf700  0 mds.0.cache creating
> system 
> inode with ino:0x1
> 
> The cluster is recovering as we are changing some of the osds, and
> there 
> are a few slow/stuck requests, but I'm not sure if this is the cause,
> as 
> there is apparently no data loss (until now).
> 
> How can I force the MDSes to quit the replay state?
> 
> Thanks for any help,
> 
> 
>      Alessandro
> 
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com