Re: A question about io consistency in osd down case

Jason Dillaman <jdillama@xxxxxxxxxx> · Mon, 12 Dec 2016 13:04:23 -0500

On Sun, Dec 11, 2016 at 10:48 PM, zhong-yan.gu <guzy126@xxxxxxx> wrote:
> Hi Jason,
> After reviewing the code, my understanding is that:
> case1: Primary osd A, replica osd B, replica osd C, min size is 2, during
> IO, C is down.
> 1. A and B journal writes are done, C failed and down.
> 2. IO waiting for osd_heartbeat_grace.
> 3. monitor detects C failure, peering.
> 4. during peering, start_peering_interval() is called, in
> ReplicatedBackend::on_change(), in_progress_ops erased, on_commit() is
> deleted, So client will never recieve ACK unless it retry this IO. If it
> retries this IO, do_op() will get this IO from pg log and return it as
> already_complete.
> Is this right? Dose client size librbd/librados has a IO retry mechnism??

While I cannot comment on your steps (OSD internals is not my area of
expertise), the answer to your question is that yes, the clients will
resend ops on a PG map update (and other conditions).

> case2: Primary osd A, replica osd B, replica osd C, min size is 2, during
> IO, B and C are down.
> the same thing is client still not recieving ACK.
> the different thing is client IO will be blocked becasue replica is under
> min_size. After peering and recovery, becasue IO was updated successfully in
> A before, so the this IO will also be recovered to A's new peers. Is this
> right?

Again, I am not an OSD guy, so take my answer with a grain of salt.
Just to clarify, osd A should not have updated its object until its
peers acked the op. You should be able to run a small three OSD test
cluster to see what the OSDs are doing in each case when you down one
or two OSDs.

> Zhongyan
>
>
>
>
>
>
> At 2016-12-10 22:00:20, "Jason Dillaman" <jdillama@xxxxxxxxxx> wrote:
>>I should clarify that if the OSD has silently failed (e.g. the TCP
>>connection wasn't reset and packets are just silently being dropped /
>>not being acked), IO will pause for up to "osd_heartbeat_grace" before
>>IO can proceed again.
>>
>>On Sat, Dec 10, 2016 at 8:46 AM, Jason Dillaman <jdillama@xxxxxxxxxx>
>> wrote:
>>> On Sat, Dec 10, 2016 at 6:11 AM, zhong-yan.gu <guzy126@xxxxxxx> wrote:
>>>> Hi Jason,
>>>> sorry to bother you. A question about io consistency in osd down case :
>>>> 1. a write op arrives primary osd A
>>>> 2. osd A does  local write and sends out replica writes to osd B and C
>>>> 3. B finishes write and return ACK to A. However C is down and has no
>>>> chance
>>>> to send out ACK.
>>>>
>>>> In this case A will not reply ACK to client. after a while cluster
>>>> detects C
>>>> is down and enters peering. after peering,  how will be the previous
>>>> write
>>>> op to be processed?
>>>
>>> AFAIK, assuming you have a replica size of 3 and a minimum replica
>>> size of 2, losing one OSD within the PG set won't be enough to stall
>>> the write operation assuming it wasn't the primary PG OSD that went
>>> offline. Quorum was available so both online OSDs were able to log the
>>> transaction to help recover the offline OSD when it becomes available
>>> again. Once the offline OSD comes back, it can replay the log received
>>> from its peers to get back in sync. There is actually lots of
>>> available documentation on this process [1].
>>>
>>>> Does the client still have a chance to receive the ACK?
>>>
>>> Yup, the client will receive the ACK as soon as <min size> PGs have
>>> safely committed the IO.
>>>
>>>> The next time if client read the corresponding data, is it updated or
>>>> not?
>>>
>>> If an IO has been ACKed back to the client, future reads to that
>>> extent will return the committed data (we don't want to go backwards
>>> in time).
>>>
>>>> For the case that both B and C are down before ack replied to A, is
>>>> there
>>>> any difference?
>>>
>>> Assuming you have <min_size> = 2 (the default), your IO would be
>>> blocked until those OSDs come back online or until the mons have
>>> detected those OSDs are dead and has remapped the affected PGs to new
>>> (online) OSDs.
>>>
>>>>
>>>> Is there any case in which ceph finished writes silently but no ack to
>>>> clients?
>>>
>>> Sure, if your client dies before it receives the ACK from the OSDs.
>>> However, your data is still crash consistent.
>>>
>>>> Zhongyan
>>>>
>>>
>>> [1]
>>> https://github.com/ceph/ceph/blob/master/doc/dev/osd_internals/log_based_pg.rst
>>>
>>> --
>>> Jason
>>
>>
>>
>>--
>>Jason
>
>
>
>

-- 
Jason
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com