Re: Down monitors after adding mds node

Adam Tygart <mozes@xxxxxxx> · Sat, 1 Oct 2016 21:19:31 -0500

The wip-fixup-mds-standby-init branch doesn't seem to allow the
ceph-mons to start up correctly. I disabled all mds servers before
starting the monitors up, so it would seem the pending mdsmap update
is in durable storage. Now that the mds servers are down, can we clear
the mdsmap of active and standby servers while initializing the mons?
I would hope that, now that all the versions are in sync, a bad
standby_for_fscid would not be possible with new mds servers starting.

--
Adam

On Fri, Sep 30, 2016 at 3:49 PM, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
> On Fri, Sep 30, 2016 at 11:39 AM, Adam Tygart <mozes@xxxxxxx> wrote:
>> Hello all,
>>
>> Not sure if this went through before or not, as I can't check the
>> mailing list archives.
>>
>> I've gotten myself into a bit of a bind. I was prepping to add a new
>> mds node to my ceph cluster. e.g. ceph-deploy mds create mormo
>>
>> Unfortunately, it started the mds server before I was ready. My
>> cluster was running 10.2.1, and the newly deployed mds is 10.2.3.
>>
>> This caused 3 of my 5 monitors to crash. Since I immediately realized
>> the mds was a newer version, I took that opportunity to upgrade my
>> monitors to 10.2.3. Three of the 5 monitors continue to crash. And it
>> looks like they are crashing when trying to apply a pending mdsmap
>> update.
>>
>> The log is available here:
>> http://people.cis.ksu.edu/~mozes/hobbit01.mon-20160930.log.gz
>>
>> I have attempted (making backups of course) to extract the monmap from
>> a working monitor and inserting it into a broken one. No luck, and
>> backup was restored.
>>
>> Since I had 2 working monitors, I backed up the monitor stores,
>> updated the monmaps to remove the broken ones and tried to restart
>> them. I then tried to restart the "working" ones. They then failed in
>> the same way. I've now restored my backups of those monitors.
>>
>> I need to get these monitors back up post-haste.
>>
>> If you've got any ideas, I would be grateful.
>
> I'm not sure but it looks like it's now too late to keep the problem
> out of the durable storage, but if you try again make sure you turn
> off the MDS first.
>
> It sort of looks like you've managed to get a failed MDS with an
> invalid fscid (ie, a cephfs filesystem ID).
>
> ...or maybe just a terrible coding mistake. As mentioned on irc,
> wip-fixup-mds-standby-init should fix it. I've created a ticket as
> well: http://tracker.ceph.com/issues/17466
> -Greg
>
>
>>
>> --
>> Adam
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com