Re: RBD journal draft design

Gregory Farnum <greg@xxxxxxxxxxx> · Tue, 9 Jun 2015 11:32:08 -0700

On Thu, Jun 4, 2015 at 5:36 PM, Jason Dillaman <dillaman@xxxxxxxxxx> wrote:
>> >> ...Actually, doesn't *not* forcing a coordinated move from one object
>> >> set to another mean that you don't actually have an ordering guarantee
>> >> across tags if you replay the journal objects in order?
>> >
>> > The ordering between tags was meant to be a soft ordering guarantee (since
>> > any number of delays could throw off the actual order as delivered from
>> > the OS).  In the case of a VM using multiple RBD images sharing the same
>> > journal, this provides an ordering guarantee per device but not between
>> > devices.
>> >
>> > This is no worse than the case of each RBD image using its own journal
>> > instead of sharing a journal and the behavior doesn't seem too different
>> > from a non-RBD case when submitting requests to two different physical
>> > devices (e.g. a SSD device and a NAS device will commit data at different
>> > latencies).
>>
>> Yes, it's exactly the same. But I thought the point was that if you
>> commingle the journals then you actually have the appropriate ordering
>> across clients/disks (if there's enough ordering and synchronization)
>> that you can stream the journal off-site and know that if there's any
>> kind of disaster you are always at least crash-consistent. If there's
>> arbitrary re-ordering of different volume writes at object boundaries
>> then I don't see what benefit there is to having a commingled journal
>> at all.
>>
>> I think there's a thing called a "consistency group" in various
>> storage platforms that is sort of similar to this, where you can take
>> a snapshot of a related group of volumes at once. I presume the
>> commingled journal is an attempt at basically having an ongoing
>> snapshot of the whole consistency group.
>
> Seems like even with a SAN-type consistency group, you could still have temporal ordering issues between volume writes unless it synchronized with the client OSes to flush out all volumes at a consistent place so that the snapshot could take place.
>
> I suppose you could provide much tighter QEMU inter-volume ordering guarantees if you modified the RBD block device so that each individual RBD image instance was provided a mechanism to coordinate the allocation of the sequence number between the images.  Right now, each image is opened in its own context w/ no knowledge of one another and no way to coordinate.  The current proposed tag + sequence number approach could be used to provide the soft inter-volume ordering guarantees until QEMU / librbd could be modified to support volume groupings.

I must not be being clear. Tell me if this scenario is possible:

* Client A writes to file foo many times and it is journaled to object set 1.
* Client B writes to file bar many times and it starts journaling to
object set 1, but hits the end and moves on to object set 2.
* Client A hits a synchronization point in its higher-level logic.
* Client A fsyncs file foo to object set 1 and then
* Client B hits the synchronization point, fsyncs file bar to object
set 2, and sends data back to Client A.
* Client A fsyncs the receipt of its data stream to object set 1, and
only then gets sent on to object set 2.
* The journal copier runs and migrates object set 1 to a remote data
center, then the data center explodes.
* In the remote data center they fail over and client A thinks it has
reached a synchronization point and gotten an acknowledgement that
client B has never heard of.

Does that being a problem make sense? I don't think handling it is
overly complicated and it's kind of important.
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html