On Thu, Jun 4, 2015 at 8:08 AM, Jason Dillaman <dillaman@xxxxxxxxxx> wrote: >> >> >A successful append will indicate whether or not the journal is now full >> >> >(larger than the max object size), indicating to the client that a new >> >> >journal object should be used. If the journal is too large, an error >> >> >code >> >> >responce would alert the client that it needs to write to the current >> >> >active journal object. In practice, the only time the journaler should >> >> >expect to see such a response would be in the case where multiple clients >> >> >are using the same journal and the active object update notification has >> >> >yet to be received. >> >> >> >> I'm confused. How does this work with the splay count thing you >> >> mentioned above? Can you define <splay count>? >> > >> > Similar to the stripe width. >> >> Okay, that sort of makes sense but I don't see how you could legally >> be writing to different "sets" so why not just make it an explicit >> striping thing and move all journal entries for that "set" at once? >> >> ...Actually, doesn't *not* forcing a coordinated move from one object >> set to another mean that you don't actually have an ordering guarantee >> across tags if you replay the journal objects in order? > > The ordering between tags was meant to be a soft ordering guarantee (since any number of delays could throw off the actual order as delivered from the OS). In the case of a VM using multiple RBD images sharing the same journal, this provides an ordering guarantee per device but not between devices. > > This is no worse than the case of each RBD image using its own journal instead of sharing a journal and the behavior doesn't seem too different from a non-RBD case when submitting requests to two different physical devices (e.g. a SSD device and a NAS device will commit data at different latencies). Yes, it's exactly the same. But I thought the point was that if you commingle the journals then you actually have the appropriate ordering across clients/disks (if there's enough ordering and synchronization) that you can stream the journal off-site and know that if there's any kind of disaster you are always at least crash-consistent. If there's arbitrary re-ordering of different volume writes at object boundaries then I don't see what benefit there is to having a commingled journal at all. I think there's a thing called a "consistency group" in various storage platforms that is sort of similar to this, where you can take a snapshot of a related group of volumes at once. I presume the commingled journal is an attempt at basically having an ongoing snapshot of the whole consistency group. -Greg -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html