Re: RBD mirroring design draft

Gregory Farnum <greg@xxxxxxxxxxx> · Thu, 28 May 2015 07:07:16 -0700

On Thu, May 28, 2015 at 3:42 AM, John Spray <john.spray@xxxxxxxxxx> wrote:
>
>
> On 28/05/2015 06:37, Gregory Farnum wrote:
>>
>> On Tue, May 12, 2015 at 5:42 PM, Josh Durgin <jdurgin@xxxxxxxxxx> wrote:
>>> Parallelism
>>> ^^^^^^^^^^^
>>>
>>> Mirroring many images is embarrassingly parallel. A simple unit of
>>> work is an image (more specifically a journal, if e.g. a group of
>>> images shared a journal as part of a consistency group in the future).
>>>
>>> Spreading this work across threads within a single process is
>>> relatively simple. For HA, and to avoid a single NIC becoming a
>>> bottleneck, we'll want to spread out the work across multiple
>>> processes (and probably multiple hosts). rbd-mirror should have no
>>> local state, so we just need a mechanism to coordinate the division of
>>> work across multiple processes.
>>>
>>> One way to do this would be layering on top of watch/notify. Each
>>> rbd-mirror process in a zone could watch the same object, and shard
>>> the set of images to mirror based on a hash of image ids onto the
>>> current set of rbd-mirror processes sorted by client gid. The set of
>>> rbd-mirror processes could be determined by listing watchers.
>>
>> You're going to have some tricky cases here when reassigning authority
>> as watchers come and go, but I think it should be doable.
>
>
> I've been fantasizing about something similar to this for CephFS backward
> scrub/recovery.  My current code supports parallelism, but relies on the
> user to script their population of workers across client nodes.
>
> I had been thinking of more of a master/slaves model, where one guy would
> get to be the master by e.g. taking the lock on an object, and he would then
> hand out work to everyone else that was a watch/notify subscriber to the
> magic object.  It seems like that could be simpler than having workers have
> to work out independently what their workload should be, and have the added
> bonus of providing a command-like mechanism in addition to continuous
> operation.

Heh. This could be the method but I caution people that it's a
brand-new use case for watch-notify and I'm not too sure how it'd
perform. I suspect we'd need to keep the chunks of work pretty large
in order to avoid the watch-notify cycle latencies being a limiting
factor. ;)

Speaking more generally, unless a peer-based model turns out to be
infeasible I much prefer that — the systems are sometimes more
complicated but generally much more resilient to failures, and tend to
be better-designed for recovery than when everything is residing in
the master's memory and then has to get reconstructed.
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html