On Sat, Oct 1, 2016 at 7:19 PM, Adam Tygart <mozes@xxxxxxx> wrote: > The wip-fixup-mds-standby-init branch doesn't seem to allow the > ceph-mons to start up correctly. I disabled all mds servers before > starting the monitors up, so it would seem the pending mdsmap update > is in durable storage. Now that the mds servers are down, can we clear > the mdsmap of active and standby servers while initializing the mons? > I would hope that, now that all the versions are in sync, a bad > standby_for_fscid would not be possible with new mds servers starting. Looks like my first guess about the run-time initialization being confused was wrong. :( Given that, we're pretty befuddled. But I commented on irc: >if you've still got a core dump, can you go up a frame (to MDSMonitor::maybe_promote_standby) and check the values of target_role.rank and target_role.fscid, and how that compares to info.standby_for_fscid, info.legacy_client_fscid, and info.standby_for_rank? That might pop up something and isn't accessible in the log you posted. We also can't see an osdmap or dump; if you could either extract and print that or get a log which includes it that might show up something. I don't think we changed the mds<-> protocol or anything in the point releases, so the different package version *shouldn't* matter...right, John? ;) -Greg > > -- > Adam > > On Fri, Sep 30, 2016 at 3:49 PM, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote: >> On Fri, Sep 30, 2016 at 11:39 AM, Adam Tygart <mozes@xxxxxxx> wrote: >>> Hello all, >>> >>> Not sure if this went through before or not, as I can't check the >>> mailing list archives. >>> >>> I've gotten myself into a bit of a bind. I was prepping to add a new >>> mds node to my ceph cluster. e.g. ceph-deploy mds create mormo >>> >>> Unfortunately, it started the mds server before I was ready. My >>> cluster was running 10.2.1, and the newly deployed mds is 10.2.3. >>> >>> This caused 3 of my 5 monitors to crash. Since I immediately realized >>> the mds was a newer version, I took that opportunity to upgrade my >>> monitors to 10.2.3. Three of the 5 monitors continue to crash. And it >>> looks like they are crashing when trying to apply a pending mdsmap >>> update. >>> >>> The log is available here: >>> http://people.cis.ksu.edu/~mozes/hobbit01.mon-20160930.log.gz >>> >>> I have attempted (making backups of course) to extract the monmap from >>> a working monitor and inserting it into a broken one. No luck, and >>> backup was restored. >>> >>> Since I had 2 working monitors, I backed up the monitor stores, >>> updated the monmaps to remove the broken ones and tried to restart >>> them. I then tried to restart the "working" ones. They then failed in >>> the same way. I've now restored my backups of those monitors. >>> >>> I need to get these monitors back up post-haste. >>> >>> If you've got any ideas, I would be grateful. >> >> I'm not sure but it looks like it's now too late to keep the problem >> out of the durable storage, but if you try again make sure you turn >> off the MDS first. >> >> It sort of looks like you've managed to get a failed MDS with an >> invalid fscid (ie, a cephfs filesystem ID). >> >> ...or maybe just a terrible coding mistake. As mentioned on irc, >> wip-fixup-mds-standby-init should fix it. I've created a ticket as >> well: http://tracker.ceph.com/issues/17466 >> -Greg >> >> >>> >>> -- >>> Adam >>> _______________________________________________ >>> ceph-users mailing list >>> ceph-users@xxxxxxxxxxxxxx >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com