On Mon, Aug 18, 2014 at 6:56 AM, Jasper Siero <jasper.siero at target-holding.nl> wrote: > Hi all, > > We have a small ceph cluster running version 0.80.1 with cephfs on five > nodes. > Last week some osd's were full and shut itself down. To help de osd's start > again I added some extra osd's and moved some placement group directories on > the full osd's (which has a copy on another osd) to another place on the > node (as mentioned in > http://ceph.com/docs/master/rados/troubleshooting/troubleshooting-osd/) > After clearing some space on the full osd's I started them again. After a > lot of deep scrubbing and two pg inconsistencies which needed to be repaired > everything looked fine except the mds which still is in the replay state and > it stays that way. > The log below says that mds need osdmap epoch 1833 and have 1832. > > 2014-08-18 12:29:22.268248 7fa786182700 1 mds.-1.0 handle_mds_map standby > 2014-08-18 12:29:22.273995 7fa786182700 1 mds.0.25 handle_mds_map i am now > mds.0.25 > 2014-08-18 12:29:22.273998 7fa786182700 1 mds.0.25 handle_mds_map state > change up:standby --> up:replay > 2014-08-18 12:29:22.274000 7fa786182700 1 mds.0.25 replay_start > 2014-08-18 12:29:22.274014 7fa786182700 1 mds.0.25 recovery set is > 2014-08-18 12:29:22.274016 7fa786182700 1 mds.0.25 need osdmap epoch 1833, > have 1832 > 2014-08-18 12:29:22.274017 7fa786182700 1 mds.0.25 waiting for osdmap 1833 > (which blacklists prior instance) > > # ceph status > cluster c78209f5-55ea-4c70-8968-2231d2b05560 > health HEALTH_WARN mds cluster is degraded > monmap e3: 3 mons at > {th1-mon001=10.1.2.21:6789/0,th1-mon002=10.1.2.22:6789/0,th1-mon003=10.1.2.23:6789/0}, > election epoch 362, quorum 0,1,2 th1-mon001,th1-mon002,th1-mon003 > mdsmap e154: 1/1/1 up {0=th1-mon001=up:replay}, 1 up:standby > osdmap e1951: 12 osds: 12 up, 12 in > pgmap v193685: 492 pgs, 4 pools, 60297 MB data, 470 kobjects > 124 GB used, 175 GB / 299 GB avail > 492 active+clean > > # ceph osd tree > # id weight type name up/down reweight > -1 0.2399 root default > -2 0.05997 host th1-osd001 > 0 0.01999 osd.0 up 1 > 1 0.01999 osd.1 up 1 > 2 0.01999 osd.2 up 1 > -3 0.05997 host th1-osd002 > 3 0.01999 osd.3 up 1 > 4 0.01999 osd.4 up 1 > 5 0.01999 osd.5 up 1 > -4 0.05997 host th1-mon003 > 6 0.01999 osd.6 up 1 > 7 0.01999 osd.7 up 1 > 8 0.01999 osd.8 up 1 > -5 0.05997 host th1-mon002 > 9 0.01999 osd.9 up 1 > 10 0.01999 osd.10 up 1 > 11 0.01999 osd.11 up 1 > > What is the way to get the mds up and running again? > > I still have all the placement group directories which I moved from the full > osds which where down to create disk space. Try just restarting the MDS daemon. This sounds a little familiar so I think it's a known bug which may be fixed in a later dev or point release on the MDS, but it's a soft-state rather than a disk state issue. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com