Hi everyone, there are a couple of bug reports about this in Redmine but only one (unanswered) mailing list message[1] that I could find. So I figured I'd raise the issue here again and copy the original reporters of the bugs (they are BCC'd, because in case they are no longer subscribed it wouldn't be appropriate to share their email addresses with the list). This is about https://tracker.ceph.com/issues/40029, and https://tracker.ceph.com/issues/39978 (the latter of which was recently closed as a duplicate of the former). In short, it appears that at least in luminous and mimic (I haven't tried nautilus yet), it's possible to crash a mon when attempting to add a new OSD as it's trying to inject itself into the crush map under its host bucket, when that host bucket does not exist yet. What's worse is that when the OSD's "ceph osd new" process has thus crashed the leader mon, a new leader is elected and in case the "ceph osd new" process is still running on the OSD node, it will promptly connect to that mon, and kill it too. This then continues until sufficiently many mons have died for quorum to be lost. The recovery steps appear to involve - killing the "ceph osd new" process, - restarting mons until you regain quorum, - and then running "ceph osd purge" to drop the problematic OSD entry from the crushmap and osdmap. The issue can apparently be worked around by adding the host buckets to the crushmap manually before adding the new OSDs, but surely this isn't intended to be a prerequisite, at least not to the point of mons crashing otherwise? Also I am guessing that this is some weird corner case rooted in an unusual combination of contributing factors, because otherwise I am guessing more people would be bitten by this problem. Anyone able to share their thoughts on this one? Have more people run into this? Cheers, Florian [1] http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-May/034880.html — interestingly I could find this message in the pipermail archive but none in the one that my MUA keeps for me. So perhaps that message wasn't delivered to all subscribers, which might be why it has gone unanswered. _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx