Re: RBD mirroring design draft

John Spray <john.spray@xxxxxxxxxx> · Thu, 28 May 2015 11:42:07 +0100

On 28/05/2015 06:37, Gregory Farnum wrote:
On Tue, May 12, 2015 at 5:42 PM, Josh Durgin <jdurgin@xxxxxxxxxx> wrote:
It will need some metadata regarding positions in the journal. These
could be stored as omap values in a 'journal header' object in a
replicated pool, for rbd perhaps the same pool as the image for
simplicity. The header would contain at least:

* pool_id - where journal data is stored
* journal_object_prefix - unique prefix for journal data objects
* positions - (zone, purpose, object num, offset) tuples indexed by zone
* object_size - approximate size of each data object
* object_num_begin - current earliest object in the log
* object_num_end - max potential object in the log

Similar to rbd images, journal data would be stored in objects named
after the journal_object_prefix and their object number. To avoid
issues of padding or splitting journal entries, and to make it simpler
to keep append-only, it's easier to allow the objects to be near
object_size before moving to the next object number instead of
sticking with an exact object size.

Ideally this underlying structure could be used for both rbd and
cephfs. Variable sized objects are different from the existing cephfs
journal, which uses fixed-size objects for striping. The default is
still 4MB chunks though. How important is striping the journal to
cephfs? For rbd it seems unlikely to help much, since updates need to
be batched up by the client cache anyway.
I think the journaling v2 stuff that John did actually made objects
variably-sized as you've described here. We've never done any sort of
striping on the MDS journal, although I think it was
possible.previously.

The objects are still fixed size: we talked about changing it so that 
journal events would never span an object boundary, but didn't do it -- 
it still uses Filer.

Parallelism
^^^^^^^^^^^

Mirroring many images is embarrassingly parallel. A simple unit of
work is an image (more specifically a journal, if e.g. a group of
images shared a journal as part of a consistency group in the future).

Spreading this work across threads within a single process is
relatively simple. For HA, and to avoid a single NIC becoming a
bottleneck, we'll want to spread out the work across multiple
processes (and probably multiple hosts). rbd-mirror should have no
local state, so we just need a mechanism to coordinate the division of
work across multiple processes.

One way to do this would be layering on top of watch/notify. Each
rbd-mirror process in a zone could watch the same object, and shard
the set of images to mirror based on a hash of image ids onto the
current set of rbd-mirror processes sorted by client gid. The set of
rbd-mirror processes could be determined by listing watchers.
You're going to have some tricky cases here when reassigning authority
as watchers come and go, but I think it should be doable.

I've been fantasizing about something similar to this for CephFS 
backward scrub/recovery.  My current code supports parallelism, but 
relies on the user to script their population of workers across client 
nodes.

I had been thinking of more of a master/slaves model, where one guy 
would get to be the master by e.g. taking the lock on an object, and he 
would then hand out work to everyone else that was a watch/notify 
subscriber to the magic object.  It seems like that could be simpler 
than having workers have to work out independently what their workload 
should be, and have the added bonus of providing a command-like 
mechanism in addition to continuous operation.

Cheers,
John
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html