Re: A question about io consistency in osd down case

Shinobu Kinjo <skinjo@xxxxxxxxxx> · Tue, 13 Dec 2016 08:26:16 +0900

On Sat, Dec 10, 2016 at 11:00 PM, Jason Dillaman <jdillama@xxxxxxxxxx> wrote:
> I should clarify that if the OSD has silently failed (e.g. the TCP
> connection wasn't reset and packets are just silently being dropped /
> not being acked), IO will pause for up to "osd_heartbeat_grace" before

The number is how long an OSD will wait for a response from another
OSD before telling the MONs that it's not responding.

> IO can proceed again.
>
> On Sat, Dec 10, 2016 at 8:46 AM, Jason Dillaman <jdillama@xxxxxxxxxx> wrote:
>> On Sat, Dec 10, 2016 at 6:11 AM, zhong-yan.gu <guzy126@xxxxxxx> wrote:
>>> Hi Jason,
>>> sorry to bother you. A question about io consistency in osd down case :
>>> 1. a write op arrives primary osd A
>>> 2. osd A does  local write and sends out replica writes to osd B and C
>>> 3. B finishes write and return ACK to A. However C is down and has no chance
>>> to send out ACK.
>>>
>>> In this case A will not reply ACK to client. after a while cluster detects C
>>> is down and enters peering. after peering,  how will be the previous write
>>> op to be processed?
>>
>> AFAIK, assuming you have a replica size of 3 and a minimum replica
>> size of 2, losing one OSD within the PG set won't be enough to stall
>> the write operation assuming it wasn't the primary PG OSD that went
>> offline. Quorum was available so both online OSDs were able to log the
>> transaction to help recover the offline OSD when it becomes available
>> again. Once the offline OSD comes back, it can replay the log received
>> from its peers to get back in sync. There is actually lots of
>> available documentation on this process [1].
>>
>>> Does the client still have a chance to receive the ACK?
>>
>> Yup, the client will receive the ACK as soon as <min size> PGs have
>> safely committed the IO.
>>
>>> The next time if client read the corresponding data, is it updated or not?
>>
>> If an IO has been ACKed back to the client, future reads to that
>> extent will return the committed data (we don't want to go backwards
>> in time).
>>
>>> For the case that both B and C are down before ack replied to A, is there
>>> any difference?
>>
>> Assuming you have <min_size> = 2 (the default), your IO would be
>> blocked until those OSDs come back online or until the mons have
>> detected those OSDs are dead and has remapped the affected PGs to new
>> (online) OSDs.
>>
>>>
>>> Is there any case in which ceph finished writes silently but no ack to
>>> clients?
>>
>> Sure, if your client dies before it receives the ACK from the OSDs.
>> However, your data is still crash consistent.
>>
>>> Zhongyan
>>>
>>
>> [1] https://github.com/ceph/ceph/blob/master/doc/dev/osd_internals/log_based_pg.rst
>>
>> --
>> Jason
>
>
>
> --
> Jason
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com