On Mon, Mar 19, 2018 at 5:26 AM, xsong682@xxxxxxxx <xsong682@xxxxxxxx> wrote: > Thanks for your reply. > > I think there are 2 scenarios: > 1. Peering for non-Primary OSD down > for example, PG [1, 3, 5] -> [1, 3, 6], osd 1 is always the primary OSD, and the data should be correct except data corruption, and we can read data directly from 1 before peered. > I think we can check "primay of last interval == primay of current interval", if equal, the primay haven't get changed, and we run into this scenario. Calcuation last interval and current > interval is much fast than peering, so we can first calc them then compare them. If you try and draw out how to do this detection, I think you'll realize that identifying if the primary OSD has changed (from only that primary's local state) is exactly as difficult as peering when you lose the primary OSD. That is Given a cluster of 5 nodes, a PG "pg" with acting OSDs 1,2,3, and a starting osdmap epoch 42; how does osd 1 differentiate between the global states 1) osdmap 43 has been issued marking down osd 3, and 2) immediately afterwards, osdmap 44 was issued marking down osd 1? The only way it can do that is by contacting osd.2 and making sure that it forms an acting set in which all osds agree "1 and 2 are acting from epoch 43". And that's exactly what peering does. We *have* discussed in the past trying to shortcut through some of the more expensive parts of peering. Right now, for reliability and reproducibility, the primary always takes the exact same steps, which involves resetting all state and rebuilding it via message passing. But if the primary doesn't change, it could avoid all that (since if it remained up the whole time, it knows the state of its peers which remained up the whole time). That's a significantly more complicated bit of work, though. -Greg > > 2. Peering for primary OSD down > if primay osd down, "primay of last interval != primay of current interval", so we need waiting for peering. > > -Xiaobing -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html