mds isn't working anymore after osd's running full

greg@xxxxxxxxxxx (Gregory Farnum) · Wed, 20 Aug 2014 09:38:03 -0700



After restarting your MDS, it still says it has epoch 1832 and needs
epoch 1833? I think you didn't really restart it.
If the epoch numbers have changed, can you restart it with "debug mds
= 20", "debug objecter = 20", "debug ms = 1" in the ceph.conf and post
the resulting log file somewhere?
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Wed, Aug 20, 2014 at 12:49 AM, Jasper Siero
<jasper.siero at target-holding.nl> wrote:
> Unfortunately that doesn't help. I restarted both the active and standby mds but that doesn't change the state of the mds. Is there a way to force the mds to look at the 1832 epoch (or earlier) instead of 1833 (need osdmap epoch 1833, have 1832)?
>
> Thanks,
>
> Jasper
> ________________________________________
> Van: Gregory Farnum [greg at inktank.com]
> Verzonden: dinsdag 19 augustus 2014 19:49
> Aan: Jasper Siero
> CC: ceph-users at lists.ceph.com
> Onderwerp: Re: [ceph-users] mds isn't working anymore after osd's running full
>
> On Mon, Aug 18, 2014 at 6:56 AM, Jasper Siero
> <jasper.siero at target-holding.nl> wrote:
>> Hi all,
>>
>> We have a small ceph cluster running version 0.80.1 with cephfs on five
>> nodes.
>> Last week some osd's were full and shut itself down. To help de osd's start
>> again I added some extra osd's and moved some placement group directories on
>> the full osd's (which has a copy on another osd) to another place on the
>> node (as mentioned in
>> http://ceph.com/docs/master/rados/troubleshooting/troubleshooting-osd/)
>> After clearing some space on the full osd's I started them again. After a
>> lot of deep scrubbing and two pg inconsistencies which needed to be repaired
>> everything looked fine except the mds which still is in the replay state and
>> it stays that way.
>> The log below says that mds need osdmap epoch 1833 and have 1832.
>>
>> 2014-08-18 12:29:22.268248 7fa786182700  1 mds.-1.0 handle_mds_map standby
>> 2014-08-18 12:29:22.273995 7fa786182700  1 mds.0.25 handle_mds_map i am now
>> mds.0.25
>> 2014-08-18 12:29:22.273998 7fa786182700  1 mds.0.25 handle_mds_map state
>> change up:standby --> up:replay
>> 2014-08-18 12:29:22.274000 7fa786182700  1 mds.0.25 replay_start
>> 2014-08-18 12:29:22.274014 7fa786182700  1 mds.0.25  recovery set is
>> 2014-08-18 12:29:22.274016 7fa786182700  1 mds.0.25  need osdmap epoch 1833,
>> have 1832
>> 2014-08-18 12:29:22.274017 7fa786182700  1 mds.0.25  waiting for osdmap 1833
>> (which blacklists prior instance)
>>
>>  # ceph status
>>     cluster c78209f5-55ea-4c70-8968-2231d2b05560
>>      health HEALTH_WARN mds cluster is degraded
>>      monmap e3: 3 mons at
>> {th1-mon001=10.1.2.21:6789/0,th1-mon002=10.1.2.22:6789/0,th1-mon003=10.1.2.23:6789/0},
>> election epoch 362, quorum 0,1,2 th1-mon001,th1-mon002,th1-mon003
>>      mdsmap e154: 1/1/1 up {0=th1-mon001=up:replay}, 1 up:standby
>>      osdmap e1951: 12 osds: 12 up, 12 in
>>       pgmap v193685: 492 pgs, 4 pools, 60297 MB data, 470 kobjects
>>             124 GB used, 175 GB / 299 GB avail
>>                  492 active+clean
>>
>> # ceph osd tree
>> # id    weight    type name    up/down    reweight
>> -1    0.2399    root default
>> -2    0.05997        host th1-osd001
>> 0    0.01999            osd.0    up    1
>> 1    0.01999            osd.1    up    1
>> 2    0.01999            osd.2    up    1
>> -3    0.05997        host th1-osd002
>> 3    0.01999            osd.3    up    1
>> 4    0.01999            osd.4    up    1
>> 5    0.01999            osd.5    up    1
>> -4    0.05997        host th1-mon003
>> 6    0.01999            osd.6    up    1
>> 7    0.01999            osd.7    up    1
>> 8    0.01999            osd.8    up    1
>> -5    0.05997        host th1-mon002
>> 9    0.01999            osd.9    up    1
>> 10    0.01999            osd.10    up    1
>> 11    0.01999            osd.11    up    1
>>
>> What is the way to get the mds up and running again?
>>
>> I still have all the placement group directories which I moved from the full
>> osds which where down to create disk space.
>
> Try just restarting the MDS daemon. This sounds a little familiar so I
> think it's a known bug which may be fixed in a later dev or point
> release on the MDS, but it's a soft-state rather than a disk state
> issue.
> -Greg
> Software Engineer #42 @ http://inktank.com | http://ceph.com