Re: A question about io consistency in osd down case

Jason Dillaman <jdillama@xxxxxxxxxx> · Sat, 10 Dec 2016 08:46:36 -0500

On Sat, Dec 10, 2016 at 6:11 AM, zhong-yan.gu <guzy126@xxxxxxx> wrote:
> Hi Jason,
> sorry to bother you. A question about io consistency in osd down case :
> 1. a write op arrives primary osd A
> 2. osd A does  local write and sends out replica writes to osd B and C
> 3. B finishes write and return ACK to A. However C is down and has no chance
> to send out ACK.
>
> In this case A will not reply ACK to client. after a while cluster detects C
> is down and enters peering. after peering,  how will be the previous write
> op to be processed?

AFAIK, assuming you have a replica size of 3 and a minimum replica
size of 2, losing one OSD within the PG set won't be enough to stall
the write operation assuming it wasn't the primary PG OSD that went
offline. Quorum was available so both online OSDs were able to log the
transaction to help recover the offline OSD when it becomes available
again. Once the offline OSD comes back, it can replay the log received
from its peers to get back in sync. There is actually lots of
available documentation on this process [1].

> Does the client still have a chance to receive the ACK?

Yup, the client will receive the ACK as soon as <min size> PGs have
safely committed the IO.

> The next time if client read the corresponding data, is it updated or not?

If an IO has been ACKed back to the client, future reads to that
extent will return the committed data (we don't want to go backwards
in time).

> For the case that both B and C are down before ack replied to A, is there
> any difference?

Assuming you have <min_size> = 2 (the default), your IO would be
blocked until those OSDs come back online or until the mons have
detected those OSDs are dead and has remapped the affected PGs to new
(online) OSDs.

>
> Is there any case in which ceph finished writes silently but no ack to
> clients?

Sure, if your client dies before it receives the ACK from the OSDs.
However, your data is still crash consistent.

> Zhongyan
>

[1] https://github.com/ceph/ceph/blob/master/doc/dev/osd_internals/log_based_pg.rst

-- 
Jason
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com