Re: RBD journal draft design

Jason Dillaman <dillaman@xxxxxxxxxx> · Thu, 4 Jun 2015 20:36:13 -0400 (EDT)

> >> ...Actually, doesn't *not* forcing a coordinated move from one object
> >> set to another mean that you don't actually have an ordering guarantee
> >> across tags if you replay the journal objects in order?
> >
> > The ordering between tags was meant to be a soft ordering guarantee (since
> > any number of delays could throw off the actual order as delivered from
> > the OS).  In the case of a VM using multiple RBD images sharing the same
> > journal, this provides an ordering guarantee per device but not between
> > devices.
> >
> > This is no worse than the case of each RBD image using its own journal
> > instead of sharing a journal and the behavior doesn't seem too different
> > from a non-RBD case when submitting requests to two different physical
> > devices (e.g. a SSD device and a NAS device will commit data at different
> > latencies).
> 
> Yes, it's exactly the same. But I thought the point was that if you
> commingle the journals then you actually have the appropriate ordering
> across clients/disks (if there's enough ordering and synchronization)
> that you can stream the journal off-site and know that if there's any
> kind of disaster you are always at least crash-consistent. If there's
> arbitrary re-ordering of different volume writes at object boundaries
> then I don't see what benefit there is to having a commingled journal
> at all.
> 
> I think there's a thing called a "consistency group" in various
> storage platforms that is sort of similar to this, where you can take
> a snapshot of a related group of volumes at once. I presume the
> commingled journal is an attempt at basically having an ongoing
> snapshot of the whole consistency group.

Seems like even with a SAN-type consistency group, you could still have temporal ordering issues between volume writes unless it synchronized with the client OSes to flush out all volumes at a consistent place so that the snapshot could take place.

I suppose you could provide much tighter QEMU inter-volume ordering guarantees if you modified the RBD block device so that each individual RBD image instance was provided a mechanism to coordinate the allocation of the sequence number between the images.  Right now, each image is opened in its own context w/ no knowledge of one another and no way to coordinate.  The current proposed tag + sequence number approach could be used to provide the soft inter-volume ordering guarantees until QEMU / librbd could be modified to support volume groupings.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html