Re: RBD journal draft design

Gregory Farnum <greg@xxxxxxxxxxx> · Tue, 2 Jun 2015 17:39:15 -0700

On Tue, Jun 2, 2015 at 8:11 AM, Jason Dillaman <dillaman@xxxxxxxxxx> wrote:
> I am posting to get wider review/feedback on this draft design.  In support of the RBD mirroring feature [1], a new client-side journaling class will be developed for use by librbd.  The implementation is designed to carry opaque journal entry payloads so it will be possible to be re-used in other applications as well in the future.  It will also use the librados API for all operations.  At a high level, a single journal will be composed of a journal header to store metadata and multiple journal objects to contain the individual journal entries.
>
> Journal objects will be named "<journal object prefix>.<journal id>.<object number>".  An individual journal object will hold one or more journal entries, appended one after another.  Journal objects will have a configurable soft maximum size.  After the size has been exceeded, a new journal object (numbered current object + number of journal objects) will be created for future journal entries and the header active set will be updated so that other clients know that a new journal object was created.
>
> In contrast to the current journal code used by CephFS, the new journal code will use sequence numbers to identify journal entries, instead of offsets within the journal.

Am I misremembering what actually got done with our journal v2 format?
I think this is done — or at least we made a move in this direction.

> Additionally, a given journal entry will not be striped across multiple journal objects.  Journal entries will be mapped to journal objects using the sequence number: <sequence number> mod <splay count> == <object number> mod <splay count> for active journal objects.

Okay, that's different.

>
> The rationale for this difference is to facilitate parallelism for appends as journal entries will be splayed across a configurable number of journal objects.  The journal API for appending a new journal entry will return a future which can be used to retrieve the assigned sequence number for the submitted journal entry payload once committed to disk. The use of a future allows for asynchronous journal entry submissions by default and can be used to simplify integration with the client-side cache writeback handler (and as a potential future enhacement to delay appends to the journal in order to satisfy EC-pool alignment requirements).
>
> Sequence numbers are treated as a monotonically increasing integer for a given value of journal entry tag.  This allows for the possibility for multiple clients to concurrently use the same journal (e.g. all RBD disks within a given VM could use the same journal).  This will provide a loose coupling of operations between different clients using the same journal.
>
> A new journal object class method will be used to submit journal entry append requests.  This will act as a gatekeeper for the concurrent client case.

The object class is going to be a big barrier to using EC pools;
unless you want to block the use of EC pools on EC pools supporting
object classes. :(

>A successful append will indicate whether or not the journal is now full (larger than the max object size), indicating to the client that a new journal object should be used.  If the journal is too large, an error code responce would alert the client that it needs to write to the current active journal object.  In practice, the only time the journaler should expect to see such a response would be in the case where multiple clients are using the same journal and the active object update notification has yet to be received.

I'm confused. How does this work with the splay count thing you
mentioned above? Can you define <splay count>?

What happens if users submit sequenced entries substantially out of
order? It sounds like if you have multiple writers (or even just a
misbehaving client) it would not be hard for one of them to grab
sequence value N, for another to fill up one of the journal entry
objects with sequences in the range [N+1]...[N+x] and then for the
user of N to get an error response.

>
> All the journal objects will be tied together by means of a journal header object, named "<journal header prefix>.<journal id>".  This object will contain the current committed journal entry positions of all registered clients.  In librbd's case, each mirrored copy of an image would be a new registered client.  OSD class methods would be used to create/manipulate the journal header to serialize modifications from multiple clients.
>
> Journal recovery / playback will iterate through each journal entry from the journal, in sequence order.  Journal objects will be prefetched (where possible) to a configurable amount to improve the latency.  Journal entry playback can use an optional client-specified filter to only iterate over entries with a matching journal entry tag.  The API will need to support the use case of an external client periodically testing the journal for new data.
>
> Journal trimming will be accomplished by removing a whole journal object.  Only after all registered users of the journal have indicated that they have committed all journal entries within the journal object (via an update to the journal header metadata) will the journal object be deleted and the header updated to indicate the new starting object number.
>
> Since the journal is designed to be append-only, there needs to be support for cases where journal entry needs to be updated out-of-band (e.g. fixing a corrupt entry similar to CephFS's current journal recovery tools).  The proposed solution is to just append a new journal entry with the same sequence number as the record to be replaced to the end of the journal (i.e. last entry for a given sequence number wins).  This also protects against accidental replays of the original append operation.  An alternative suggestion would be to use a compare-and-swap mechanism to update the full journal object with the updated contents.

I'm confused by this bit. It seems to imply that fetching a single
entry requires checking the entire object to make sure there's no
replacement. Certainly if we were doing replay we couldn't just apply
each entry sequentially any more because an overwritten entry might
have its value replaced by a later (by sequence number) entry that
occurs earlier (by offset) in the journal.

I'd also like it if we could organize a single Journal implementation
within the Ceph project, or at least have a blessed one going forward
that we use for new stuff and might plausibly migrate existing users
to. The big things I see different from osdc/Journaler are:

1) (design) class-based
2) (design) uses librados instead of Objecter (hurray)
3) (need) should allow multiple writers
4) (fallout of other choices?) does not stripe entries across multiple objects

Using librados instead of the Objecter might make this tough to use in
the MDS, but we've already got journaling happening in a separate
thread and it's one of the more isolated bits of code so we might be
able to handle it. I'm not sure if we'd want to stripe across objects
or not, but the possibility does appeal to me.

>
> Journal Header
> ~~~~~~~~~~~~~~
>
> omap
> * soft max object size
> * journal objects splay count
> * min object number
> * most recent active journal objects (could be out-of-date)
> * registered clients
>   * client description (i.e. zone)
>   * journal entry tag
>   * last committed sequence number

omap definitely doesn't go in EC pools — I'm not sure how blue-sky you
were thinking when you mentioned those. :)

More generally the naive client implementation would be pretty slow to
commit something (go to header for sequence number, write data out).
Do you expect to always have a queue of sequence numbers available in
case you need to do an immediate commit of data? What makes the single
header sequence assignment be not a bottleneck on its own for multiple
clients? It will need to do a write each time...
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html