Re: question on rbd mirror feature

Jason Dillaman <dillaman@xxxxxxxxxx> · Wed, 2 Mar 2016 22:02:08 -0500 (EST)

> Hi Jason,

> I’m a software engineer in Intel. We’re trying to do some tests with the new
> rbd mirror feature and got some basic questions here:
> 1. Each rbd_write will be append to journal then ACK to clients. Rbd journal
> will flush the contents back to rados with some policy. In the flush period,
> will the rbd journal read data out from the journal objects and then do the
> flush?

Correct, a write will first append the event to the journal before proceeding with the actual write to the image.  If caching is enabled, the write is immediately applied to the in-memory cache but any writeback of the affected extent will be paused until the event and its predecessors are safe in the journal.  Once the journal event append is acked by rados, the actual write to the RBD objects can proceed (i.e. unpause any writeback if the cache is enabled or actually perform the write if cache is disabled).  The default configuration will append the events to the journal without batching.  There are configurable options to batch the journal append for X seconds, or Y bytes, or Z events -- depending on how much data you are comfortable losing in the event of a crash.  Note that any flush request to librbd will flush out any batched journal events.  There is an open tracker ticket [1] to optionally throttle librbd flush requests if, again, a user doesn't mind losing X seconds of data even in the presence of flush requests.

> 2. If a rbd_read is accessing the contents that just got written(in rbd
> journal but not flushed back), will it serviced from the rbd journal?

If cache is enabled, we service the read from the in-memory cache (since I didn't want to develop essentially another cache layer).  If the cache is disabled, we won't ack the librbd write request until the RBD image has been updated.

> 3. Is rbd journal feature working with the existing rbd cache? If yes then
> rbd journal should be laying under of rbd cache?

Yes, the tentacles of journaling reach into the cache for tracking which extents are associated to which journal event.  When librbd adds an extent to the cache, it provides the associated journal event identifier.   The cache layer will provide this identifier back to librbd when requesting writeback so that librbd can "pause" the request until the journal event is safe (if needed).  The cache also tracks when extents are overwritten by future write requests and again informs librbd that it should never expect to receive a writeback for a particular journal event extent.  There is some possible future optimization with this if journal event batching is enabled where a complete event overwrite could result in the batched journal event being removed from the pending append-to-journal list.

> Thanks for implementing this amazon feature!

> -yuan

[1] http://tracker.ceph.com/issues/13983

-- 

Jason Dillaman 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html