We have had multiple clusters experiencing the following situation over the past few months on both 14.2.6 and 14.2.11. On a few instances it seemed random , in a second situation we had temporary networking disruption, in a third situation we accidentally made some osd changes which caused certain OSDs to reach the hard limit of too many pgs-per-osd and said PGs were stuck inactive. Regardless of the scenario one monitor always falls out of quorum and seemingly can't rejoin and its logs contain the following: 2020-11-18 05:34:54.113 7f14286d9700 1 mon.a2plcephmon01@2(probing) e30 handle_auth_request failed to assign global_id 2020-11-18 05:34:54.295 7f14286d9700 -1 mon.a2plcephmon01@2(probing) e30 handle_auth_bad_method hmm, they didn't like 2 result (13) Permission denied 2020-11-18 05:34:55.397 7f14286d9700 1 mon.a2plcephmon01@2(probing) e30 handle_auth_request failed to assign global_id Ultimately, we take the quick route of rebuilding the mon by wiping its DB and re-doing a mkfs and rejoining the quorum using: sudo -u ceph ceph-mon -i a2plcephmon01 --public-addr 10.1.1.1:3300 Luckily we have never dropped below 50% in quorum so far but we are very interested in preventing this from happening going forward. Inspecting "ceph mon dump" on the given clusters I see that all of the rebuilt mons use only msgrv2 on 3300 but all of the mons that never required rebuilding are using v1 and v2 addressing. So my question are: - Does this monitor failure sound familiar? - Is it problematic the manner in which we are rebuilding mons and it only lands them on using msgr v2 or should we be using v2 only on an all-nautilus cluster? Thanks for any insight. Respectfully, *Wes Dillingham* wes@xxxxxxxxxxxxxxxxx LinkedIn <http://www.linkedin.com/in/wesleydillingham> _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx