Re: 答复: Reading data before peered to improve performance

Gregory Farnum <gfarnum@xxxxxxxxxx> · Mon, 2 Apr 2018 11:02:41 -0700

On Mon, Mar 19, 2018 at 5:26 AM, xsong682@xxxxxxxx <xsong682@xxxxxxxx> wrote:
> Thanks for your reply.
>
> I think there are 2 scenarios:
> 1. Peering for non-Primary OSD down
> for example,  PG [1, 3, 5] -> [1, 3, 6], osd 1 is always the primary OSD, and the data should be correct except data corruption, and we can read data directly from 1 before peered.
> I think we can check  "primay of last interval == primay of current interval", if equal, the primay haven't get changed, and we run into this scenario. Calcuation last interval and current
> interval is much fast than peering, so we can first calc them then compare them.

If you try and draw out how to do this detection, I think you'll
realize that identifying if the primary OSD has changed (from only
that primary's local state) is exactly as difficult as peering when
you lose the primary OSD.

That is
Given a cluster of 5 nodes, a PG "pg" with acting OSDs 1,2,3, and a
starting osdmap epoch 42; how does osd 1 differentiate between the
global states
1) osdmap 43 has been issued marking down osd 3, and
2) immediately afterwards, osdmap 44 was issued marking down osd 1?

The only way it can do that is by contacting osd.2 and making sure
that it forms an acting set in which all osds agree "1 and 2 are
acting from epoch 43". And that's exactly what peering does.

We *have* discussed in the past trying to shortcut through some of the
more expensive parts of peering. Right now, for reliability and
reproducibility, the primary always takes the exact same steps, which
involves resetting all state and rebuilding it via message passing.
But if the primary doesn't change, it could avoid all that (since if
it remained up the whole time, it knows the state of its peers which
remained up the whole time). That's a significantly more complicated
bit of work, though.
-Greg

>
> 2. Peering for primary OSD down
> if primay  osd down, "primay of last interval != primay of current interval", so we need waiting for peering.
>
> -Xiaobing
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html