Re: RBD journal draft design

Gregory Farnum <greg@xxxxxxxxxxx> · Thu, 4 Jun 2015 13:25:32 -0700

On Thu, Jun 4, 2015 at 8:08 AM, Jason Dillaman <dillaman@xxxxxxxxxx> wrote:
>> >> >A successful append will indicate whether or not the journal is now full
>> >> >(larger than the max object size), indicating to the client that a new
>> >> >journal object should be used.  If the journal is too large, an error
>> >> >code
>> >> >responce would alert the client that it needs to write to the current
>> >> >active journal object.  In practice, the only time the journaler should
>> >> >expect to see such a response would be in the case where multiple clients
>> >> >are using the same journal and the active object update notification has
>> >> >yet to be received.
>> >>
>> >> I'm confused. How does this work with the splay count thing you
>> >> mentioned above? Can you define <splay count>?
>> >
>> > Similar to the stripe width.
>>
>> Okay, that sort of makes sense but I don't see how you could legally
>> be writing to different "sets" so why not just make it an explicit
>> striping thing and move all journal entries for that "set" at once?
>>
>> ...Actually, doesn't *not* forcing a coordinated move from one object
>> set to another mean that you don't actually have an ordering guarantee
>> across tags if you replay the journal objects in order?
>
> The ordering between tags was meant to be a soft ordering guarantee (since any number of delays could throw off the actual order as delivered from the OS).  In the case of a VM using multiple RBD images sharing the same journal, this provides an ordering guarantee per device but not between devices.
>
> This is no worse than the case of each RBD image using its own journal instead of sharing a journal and the behavior doesn't seem too different from a non-RBD case when submitting requests to two different physical devices (e.g. a SSD device and a NAS device will commit data at different latencies).

Yes, it's exactly the same. But I thought the point was that if you
commingle the journals then you actually have the appropriate ordering
across clients/disks (if there's enough ordering and synchronization)
that you can stream the journal off-site and know that if there's any
kind of disaster you are always at least crash-consistent. If there's
arbitrary re-ordering of different volume writes at object boundaries
then I don't see what benefit there is to having a commingled journal
at all.

I think there's a thing called a "consistency group" in various
storage platforms that is sort of similar to this, where you can take
a snapshot of a related group of volumes at once. I presume the
commingled journal is an attempt at basically having an ongoing
snapshot of the whole consistency group.
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html