The wip-fixup-mds-standby-init branch doesn't seem to allow the ceph-mons to start up correctly. I disabled all mds servers before starting the monitors up, so it would seem the pending mdsmap update is in durable storage. Now that the mds servers are down, can we clear the mdsmap of active and standby servers while initializing the mons? I would hope that, now that all the versions are in sync, a bad standby_for_fscid would not be possible with new mds servers starting. -- Adam On Fri, Sep 30, 2016 at 3:49 PM, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote: > On Fri, Sep 30, 2016 at 11:39 AM, Adam Tygart <mozes@xxxxxxx> wrote: >> Hello all, >> >> Not sure if this went through before or not, as I can't check the >> mailing list archives. >> >> I've gotten myself into a bit of a bind. I was prepping to add a new >> mds node to my ceph cluster. e.g. ceph-deploy mds create mormo >> >> Unfortunately, it started the mds server before I was ready. My >> cluster was running 10.2.1, and the newly deployed mds is 10.2.3. >> >> This caused 3 of my 5 monitors to crash. Since I immediately realized >> the mds was a newer version, I took that opportunity to upgrade my >> monitors to 10.2.3. Three of the 5 monitors continue to crash. And it >> looks like they are crashing when trying to apply a pending mdsmap >> update. >> >> The log is available here: >> http://people.cis.ksu.edu/~mozes/hobbit01.mon-20160930.log.gz >> >> I have attempted (making backups of course) to extract the monmap from >> a working monitor and inserting it into a broken one. No luck, and >> backup was restored. >> >> Since I had 2 working monitors, I backed up the monitor stores, >> updated the monmaps to remove the broken ones and tried to restart >> them. I then tried to restart the "working" ones. They then failed in >> the same way. I've now restored my backups of those monitors. >> >> I need to get these monitors back up post-haste. >> >> If you've got any ideas, I would be grateful. > > I'm not sure but it looks like it's now too late to keep the problem > out of the durable storage, but if you try again make sure you turn > off the MDS first. > > It sort of looks like you've managed to get a failed MDS with an > invalid fscid (ie, a cephfs filesystem ID). > > ...or maybe just a terrible coding mistake. As mentioned on irc, > wip-fixup-mds-standby-init should fix it. I've created a ticket as > well: http://tracker.ceph.com/issues/17466 > -Greg > > >> >> -- >> Adam >> _______________________________________________ >> ceph-users mailing list >> ceph-users@xxxxxxxxxxxxxx >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com