> >> ...Actually, doesn't *not* forcing a coordinated move from one object > >> set to another mean that you don't actually have an ordering guarantee > >> across tags if you replay the journal objects in order? > > > > The ordering between tags was meant to be a soft ordering guarantee (since > > any number of delays could throw off the actual order as delivered from > > the OS). In the case of a VM using multiple RBD images sharing the same > > journal, this provides an ordering guarantee per device but not between > > devices. > > > > This is no worse than the case of each RBD image using its own journal > > instead of sharing a journal and the behavior doesn't seem too different > > from a non-RBD case when submitting requests to two different physical > > devices (e.g. a SSD device and a NAS device will commit data at different > > latencies). > > Yes, it's exactly the same. But I thought the point was that if you > commingle the journals then you actually have the appropriate ordering > across clients/disks (if there's enough ordering and synchronization) > that you can stream the journal off-site and know that if there's any > kind of disaster you are always at least crash-consistent. If there's > arbitrary re-ordering of different volume writes at object boundaries > then I don't see what benefit there is to having a commingled journal > at all. > > I think there's a thing called a "consistency group" in various > storage platforms that is sort of similar to this, where you can take > a snapshot of a related group of volumes at once. I presume the > commingled journal is an attempt at basically having an ongoing > snapshot of the whole consistency group. Seems like even with a SAN-type consistency group, you could still have temporal ordering issues between volume writes unless it synchronized with the client OSes to flush out all volumes at a consistent place so that the snapshot could take place. I suppose you could provide much tighter QEMU inter-volume ordering guarantees if you modified the RBD block device so that each individual RBD image instance was provided a mechanism to coordinate the allocation of the sequence number between the images. Right now, each image is opened in its own context w/ no knowledge of one another and no way to coordinate. The current proposed tag + sequence number approach could be used to provide the soft inter-volume ordering guarantees until QEMU / librbd could be modified to support volume groupings. -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html