On Sun, Dec 11, 2016 at 10:48 PM, zhong-yan.gu <guzy126@xxxxxxx> wrote: > Hi Jason, > After reviewing the code, my understanding is that: > case1: Primary osd A, replica osd B, replica osd C, min size is 2, during > IO, C is down. > 1. A and B journal writes are done, C failed and down. > 2. IO waiting for osd_heartbeat_grace. > 3. monitor detects C failure, peering. > 4. during peering, start_peering_interval() is called, in > ReplicatedBackend::on_change(), in_progress_ops erased, on_commit() is > deleted, So client will never recieve ACK unless it retry this IO. If it > retries this IO, do_op() will get this IO from pg log and return it as > already_complete. > Is this right? Dose client size librbd/librados has a IO retry mechnism?? While I cannot comment on your steps (OSD internals is not my area of expertise), the answer to your question is that yes, the clients will resend ops on a PG map update (and other conditions). > case2: Primary osd A, replica osd B, replica osd C, min size is 2, during > IO, B and C are down. > the same thing is client still not recieving ACK. > the different thing is client IO will be blocked becasue replica is under > min_size. After peering and recovery, becasue IO was updated successfully in > A before, so the this IO will also be recovered to A's new peers. Is this > right? Again, I am not an OSD guy, so take my answer with a grain of salt. Just to clarify, osd A should not have updated its object until its peers acked the op. You should be able to run a small three OSD test cluster to see what the OSDs are doing in each case when you down one or two OSDs. > Zhongyan > > > > > > > At 2016-12-10 22:00:20, "Jason Dillaman" <jdillama@xxxxxxxxxx> wrote: >>I should clarify that if the OSD has silently failed (e.g. the TCP >>connection wasn't reset and packets are just silently being dropped / >>not being acked), IO will pause for up to "osd_heartbeat_grace" before >>IO can proceed again. >> >>On Sat, Dec 10, 2016 at 8:46 AM, Jason Dillaman <jdillama@xxxxxxxxxx> >> wrote: >>> On Sat, Dec 10, 2016 at 6:11 AM, zhong-yan.gu <guzy126@xxxxxxx> wrote: >>>> Hi Jason, >>>> sorry to bother you. A question about io consistency in osd down case : >>>> 1. a write op arrives primary osd A >>>> 2. osd A does local write and sends out replica writes to osd B and C >>>> 3. B finishes write and return ACK to A. However C is down and has no >>>> chance >>>> to send out ACK. >>>> >>>> In this case A will not reply ACK to client. after a while cluster >>>> detects C >>>> is down and enters peering. after peering, how will be the previous >>>> write >>>> op to be processed? >>> >>> AFAIK, assuming you have a replica size of 3 and a minimum replica >>> size of 2, losing one OSD within the PG set won't be enough to stall >>> the write operation assuming it wasn't the primary PG OSD that went >>> offline. Quorum was available so both online OSDs were able to log the >>> transaction to help recover the offline OSD when it becomes available >>> again. Once the offline OSD comes back, it can replay the log received >>> from its peers to get back in sync. There is actually lots of >>> available documentation on this process [1]. >>> >>>> Does the client still have a chance to receive the ACK? >>> >>> Yup, the client will receive the ACK as soon as <min size> PGs have >>> safely committed the IO. >>> >>>> The next time if client read the corresponding data, is it updated or >>>> not? >>> >>> If an IO has been ACKed back to the client, future reads to that >>> extent will return the committed data (we don't want to go backwards >>> in time). >>> >>>> For the case that both B and C are down before ack replied to A, is >>>> there >>>> any difference? >>> >>> Assuming you have <min_size> = 2 (the default), your IO would be >>> blocked until those OSDs come back online or until the mons have >>> detected those OSDs are dead and has remapped the affected PGs to new >>> (online) OSDs. >>> >>>> >>>> Is there any case in which ceph finished writes silently but no ack to >>>> clients? >>> >>> Sure, if your client dies before it receives the ACK from the OSDs. >>> However, your data is still crash consistent. >>> >>>> Zhongyan >>>> >>> >>> [1] >>> https://github.com/ceph/ceph/blob/master/doc/dev/osd_internals/log_based_pg.rst >>> >>> -- >>> Jason >> >> >> >>-- >>Jason > > > > -- Jason _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com