mds isn't working anymore after osd's running full

jasper.siero@xxxxxxxxxxxxxxxxx (Jasper Siero) · Mon, 18 Aug 2014 13:56:34 +0000

Hi all,

We have a small ceph cluster running version 0.80.1 with cephfs on five nodes.
Last week some osd's were full and shut itself down. To help de osd's start again I added some extra osd's and moved some placement group directories on the full osd's (which has a copy on another osd) to another place on the node (as mentioned in http://ceph.com/docs/master/rados/troubleshooting/troubleshooting-osd/)
After clearing some space on the full osd's I started them again. After a lot of deep scrubbing and two pg inconsistencies which needed to be repaired everything looked fine except the mds which still is in the replay state and it stays that way.
The log below says that mds need osdmap epoch 1833 and have 1832.

2014-08-18 12:29:22.268248 7fa786182700  1 mds.-1.0 handle_mds_map standby
2014-08-18 12:29:22.273995 7fa786182700  1 mds.0.25 handle_mds_map i am now mds.0.25
2014-08-18 12:29:22.273998 7fa786182700  1 mds.0.25 handle_mds_map state change up:standby --> up:replay
2014-08-18 12:29:22.274000 7fa786182700  1 mds.0.25 replay_start
2014-08-18 12:29:22.274014 7fa786182700  1 mds.0.25  recovery set is
2014-08-18 12:29:22.274016 7fa786182700  1 mds.0.25  need osdmap epoch 1833, have 1832
2014-08-18 12:29:22.274017 7fa786182700  1 mds.0.25  waiting for osdmap 1833 (which blacklists prior instance)

 # ceph status
    cluster c78209f5-55ea-4c70-8968-2231d2b05560
     health HEALTH_WARN mds cluster is degraded
     monmap e3: 3 mons at {th1-mon001=10.1.2.21:6789/0,th1-mon002=10.1.2.22:6789/0,th1-mon003=10.1.2.23:6789/0}, election epoch 362, quorum 0,1,2 th1-mon001,th1-mon002,th1-mon003
     mdsmap e154: 1/1/1 up {0=th1-mon001=up:replay}, 1 up:standby
     osdmap e1951: 12 osds: 12 up, 12 in
      pgmap v193685: 492 pgs, 4 pools, 60297 MB data, 470 kobjects
            124 GB used, 175 GB / 299 GB avail
                 492 active+clean

# ceph osd tree
# id    weight    type name    up/down    reweight
-1    0.2399    root default
-2    0.05997        host th1-osd001
0    0.01999            osd.0    up    1
1    0.01999            osd.1    up    1
2    0.01999            osd.2    up    1
-3    0.05997        host th1-osd002
3    0.01999            osd.3    up    1
4    0.01999            osd.4    up    1
5    0.01999            osd.5    up    1
-4    0.05997        host th1-mon003
6    0.01999            osd.6    up    1
7    0.01999            osd.7    up    1
8    0.01999            osd.8    up    1
-5    0.05997        host th1-mon002
9    0.01999            osd.9    up    1
10    0.01999            osd.10    up    1
11    0.01999            osd.11    up    1

What is the way to get the mds up and running again?

I still have all the placement group directories which I moved from the full osds which where down to create disk space.

Kind regards,

Jasper Siero
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140818/23b049eb/attachment.htm>