Re: Down monitors after adding mds node

Gregory Farnum <gfarnum@xxxxxxxxxx> · Sun, 2 Oct 2016 14:26:28 -0700

On Sat, Oct 1, 2016 at 7:19 PM, Adam Tygart <mozes@xxxxxxx> wrote:
> The wip-fixup-mds-standby-init branch doesn't seem to allow the
> ceph-mons to start up correctly. I disabled all mds servers before
> starting the monitors up, so it would seem the pending mdsmap update
> is in durable storage. Now that the mds servers are down, can we clear
> the mdsmap of active and standby servers while initializing the mons?
> I would hope that, now that all the versions are in sync, a bad
> standby_for_fscid would not be possible with new mds servers starting.

Looks like my first guess about the run-time initialization being
confused was wrong. :(
Given that, we're pretty befuddled. But I commented on irc:

>if you've still got a core dump, can you go up a frame (to MDSMonitor::maybe_promote_standby) and check the values of target_role.rank and target_role.fscid, and how that compares to info.standby_for_fscid, info.legacy_client_fscid, and info.standby_for_rank?

That might pop up something and isn't accessible in the log you
posted. We also can't see an osdmap or dump; if you could either
extract and print that or get a log which includes it that might show
up something.

I don't think we changed the mds<-> protocol or anything in the point
releases, so the different package version *shouldn't* matter...right,
John? ;)
-Greg

>
> --
> Adam
>
> On Fri, Sep 30, 2016 at 3:49 PM, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
>> On Fri, Sep 30, 2016 at 11:39 AM, Adam Tygart <mozes@xxxxxxx> wrote:
>>> Hello all,
>>>
>>> Not sure if this went through before or not, as I can't check the
>>> mailing list archives.
>>>
>>> I've gotten myself into a bit of a bind. I was prepping to add a new
>>> mds node to my ceph cluster. e.g. ceph-deploy mds create mormo
>>>
>>> Unfortunately, it started the mds server before I was ready. My
>>> cluster was running 10.2.1, and the newly deployed mds is 10.2.3.
>>>
>>> This caused 3 of my 5 monitors to crash. Since I immediately realized
>>> the mds was a newer version, I took that opportunity to upgrade my
>>> monitors to 10.2.3. Three of the 5 monitors continue to crash. And it
>>> looks like they are crashing when trying to apply a pending mdsmap
>>> update.
>>>
>>> The log is available here:
>>> http://people.cis.ksu.edu/~mozes/hobbit01.mon-20160930.log.gz
>>>
>>> I have attempted (making backups of course) to extract the monmap from
>>> a working monitor and inserting it into a broken one. No luck, and
>>> backup was restored.
>>>
>>> Since I had 2 working monitors, I backed up the monitor stores,
>>> updated the monmaps to remove the broken ones and tried to restart
>>> them. I then tried to restart the "working" ones. They then failed in
>>> the same way. I've now restored my backups of those monitors.
>>>
>>> I need to get these monitors back up post-haste.
>>>
>>> If you've got any ideas, I would be grateful.
>>
>> I'm not sure but it looks like it's now too late to keep the problem
>> out of the durable storage, but if you try again make sure you turn
>> off the MDS first.
>>
>> It sort of looks like you've managed to get a failed MDS with an
>> invalid fscid (ie, a cephfs filesystem ID).
>>
>> ...or maybe just a terrible coding mistake. As mentioned on irc,
>> wip-fixup-mds-standby-init should fix it. I've created a ticket as
>> well: http://tracker.ceph.com/issues/17466
>> -Greg
>>
>>
>>>
>>> --
>>> Adam
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@xxxxxxxxxxxxxx
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com