After restarting your MDS, it still says it has epoch 1832 and needs epoch 1833? I think you didn't really restart it. If the epoch numbers have changed, can you restart it with "debug mds = 20", "debug objecter = 20", "debug ms = 1" in the ceph.conf and post the resulting log file somewhere? -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Wed, Aug 20, 2014 at 12:49 AM, Jasper Siero <jasper.siero at target-holding.nl> wrote: > Unfortunately that doesn't help. I restarted both the active and standby mds but that doesn't change the state of the mds. Is there a way to force the mds to look at the 1832 epoch (or earlier) instead of 1833 (need osdmap epoch 1833, have 1832)? > > Thanks, > > Jasper > ________________________________________ > Van: Gregory Farnum [greg at inktank.com] > Verzonden: dinsdag 19 augustus 2014 19:49 > Aan: Jasper Siero > CC: ceph-users at lists.ceph.com > Onderwerp: Re: [ceph-users] mds isn't working anymore after osd's running full > > On Mon, Aug 18, 2014 at 6:56 AM, Jasper Siero > <jasper.siero at target-holding.nl> wrote: >> Hi all, >> >> We have a small ceph cluster running version 0.80.1 with cephfs on five >> nodes. >> Last week some osd's were full and shut itself down. To help de osd's start >> again I added some extra osd's and moved some placement group directories on >> the full osd's (which has a copy on another osd) to another place on the >> node (as mentioned in >> http://ceph.com/docs/master/rados/troubleshooting/troubleshooting-osd/) >> After clearing some space on the full osd's I started them again. After a >> lot of deep scrubbing and two pg inconsistencies which needed to be repaired >> everything looked fine except the mds which still is in the replay state and >> it stays that way. >> The log below says that mds need osdmap epoch 1833 and have 1832. >> >> 2014-08-18 12:29:22.268248 7fa786182700 1 mds.-1.0 handle_mds_map standby >> 2014-08-18 12:29:22.273995 7fa786182700 1 mds.0.25 handle_mds_map i am now >> mds.0.25 >> 2014-08-18 12:29:22.273998 7fa786182700 1 mds.0.25 handle_mds_map state >> change up:standby --> up:replay >> 2014-08-18 12:29:22.274000 7fa786182700 1 mds.0.25 replay_start >> 2014-08-18 12:29:22.274014 7fa786182700 1 mds.0.25 recovery set is >> 2014-08-18 12:29:22.274016 7fa786182700 1 mds.0.25 need osdmap epoch 1833, >> have 1832 >> 2014-08-18 12:29:22.274017 7fa786182700 1 mds.0.25 waiting for osdmap 1833 >> (which blacklists prior instance) >> >> # ceph status >> cluster c78209f5-55ea-4c70-8968-2231d2b05560 >> health HEALTH_WARN mds cluster is degraded >> monmap e3: 3 mons at >> {th1-mon001=10.1.2.21:6789/0,th1-mon002=10.1.2.22:6789/0,th1-mon003=10.1.2.23:6789/0}, >> election epoch 362, quorum 0,1,2 th1-mon001,th1-mon002,th1-mon003 >> mdsmap e154: 1/1/1 up {0=th1-mon001=up:replay}, 1 up:standby >> osdmap e1951: 12 osds: 12 up, 12 in >> pgmap v193685: 492 pgs, 4 pools, 60297 MB data, 470 kobjects >> 124 GB used, 175 GB / 299 GB avail >> 492 active+clean >> >> # ceph osd tree >> # id weight type name up/down reweight >> -1 0.2399 root default >> -2 0.05997 host th1-osd001 >> 0 0.01999 osd.0 up 1 >> 1 0.01999 osd.1 up 1 >> 2 0.01999 osd.2 up 1 >> -3 0.05997 host th1-osd002 >> 3 0.01999 osd.3 up 1 >> 4 0.01999 osd.4 up 1 >> 5 0.01999 osd.5 up 1 >> -4 0.05997 host th1-mon003 >> 6 0.01999 osd.6 up 1 >> 7 0.01999 osd.7 up 1 >> 8 0.01999 osd.8 up 1 >> -5 0.05997 host th1-mon002 >> 9 0.01999 osd.9 up 1 >> 10 0.01999 osd.10 up 1 >> 11 0.01999 osd.11 up 1 >> >> What is the way to get the mds up and running again? >> >> I still have all the placement group directories which I moved from the full >> osds which where down to create disk space. > > Try just restarting the MDS daemon. This sounds a little familiar so I > think it's a known bug which may be fixed in a later dev or point > release on the MDS, but it's a soft-state rather than a disk state > issue. > -Greg > Software Engineer #42 @ http://inktank.com | http://ceph.com