I pushed a wip-osd-hb branch that vastly simplifies the OSD heartbeats. The problem was that ages ago I went for a model with asymmetric heartbeats (at the time, replicas -> primaries) because it was elegant and seemed more efficient. The reality was that the asynchrony between osdmap versions on different modes made this a huge pain to make reobust, particularly when it came to managing the persistence of sessions between nodes that are going up/down. The new branch throws that all out and uses a simple ping/reply model. The retry behavior is simple, robust, and all the failure issues go away. The downside is that there are more messages moving around... but, they are tiny, so who cares... This should address the problems Wido was seeing in #2116. sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html