Re: Understanding Ceph

Bill Hastings <bllhastings@xxxxxxxxx> · Sun, 18 Dec 2011 09:37:41 -0800

These are perhaps very inane questions but I am trying to wrap my head
around this whole thing. So basically the primary OSD handling a
particular PG will make sure that the writes happen at all replicas. I
am assuming the client would timeout in case it doesn't get a
ack/commit within some time period. Is this true? If this is the case
the write is deemed a failure but subsequent read could read the value
of the failed write because there don't seem to be any rollback
semantics. Is my understanding correct?

Are these replicas fixed for a PG? If for a PG the replicas are A, B
and C and C happens to be down during the write will the write be
deemed a failure till C is back up or is another replica chosen in
place of C?

On Sun, Dec 18, 2011 at 9:17 AM, Yehuda Sadeh Weinraub
<yehuda.sadeh@xxxxxxxxxxxxx> wrote:
> On Sun, Dec 18, 2011 at 8:43 AM, Bill Hastings <bllhastings@xxxxxxxxx> wrote:
>> Thanks for the response. What if a write of 16 bytes was successful at
>> nodes A and B and failed at C, perhaps C had a momentarily unreachable
>> via the network? How is the Ceph client prevented from performing the
>> next read at C? Also what if the writes to OSD's were successful but
>
> In that case the client wouldn't have gotten a successful response in
> the first place. The client sends the writes to the primary osd
> handling that pg, and will get the following responses from it:
>  - ack message when the request is in the page/buffer cache on all replicas
>  - commit message when the request is on stable storage on all replicas
>
> (depending on setup, in some cases it'll just get a commit message
> which implies ack anyway)
>
> The osd is responsible that data was written to all replicas, and the
> client wouldn't get the commit response until then. For rbd, clients
> wait for the commit message as an acknowledgment to write completion.
>
>> the metadata update fails? How is this managed if at all? How are
>
> What kind of metadata are you referring to? For rbd there is no metadata update.
>
>> writes that straddle chunk boundaries handled from a transactional
>> perspective? I am just in the process of investigation so please
>> forgive me if the questions are very naive.
>
> Depending on which client we're talking about. The short answer is
> that the client will only get a response after all chunks were written
> and acknowledged.
>
> However, there are currently two different implementations; one is in
> the linux kernel and the other one is based on librbd. In the linux
> kernel, acknowledging the write is being done in byte order of the
> request. That is, only after the first chunk was acked, the second one
> would be acked, even if the osds responded in different order.
> In librbd there's another complexity since we can cache the requests
> and respond with an early ack. Ignoring that, client will only get a
> response after all chunks were applied and acked.
>
>
> Yehuda
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html