On Thu, May 28, 2015 at 3:42 AM, John Spray <john.spray@xxxxxxxxxx> wrote: > > > On 28/05/2015 06:37, Gregory Farnum wrote: >> >> On Tue, May 12, 2015 at 5:42 PM, Josh Durgin <jdurgin@xxxxxxxxxx> wrote: >>> Parallelism >>> ^^^^^^^^^^^ >>> >>> Mirroring many images is embarrassingly parallel. A simple unit of >>> work is an image (more specifically a journal, if e.g. a group of >>> images shared a journal as part of a consistency group in the future). >>> >>> Spreading this work across threads within a single process is >>> relatively simple. For HA, and to avoid a single NIC becoming a >>> bottleneck, we'll want to spread out the work across multiple >>> processes (and probably multiple hosts). rbd-mirror should have no >>> local state, so we just need a mechanism to coordinate the division of >>> work across multiple processes. >>> >>> One way to do this would be layering on top of watch/notify. Each >>> rbd-mirror process in a zone could watch the same object, and shard >>> the set of images to mirror based on a hash of image ids onto the >>> current set of rbd-mirror processes sorted by client gid. The set of >>> rbd-mirror processes could be determined by listing watchers. >> >> You're going to have some tricky cases here when reassigning authority >> as watchers come and go, but I think it should be doable. > > > I've been fantasizing about something similar to this for CephFS backward > scrub/recovery. My current code supports parallelism, but relies on the > user to script their population of workers across client nodes. > > I had been thinking of more of a master/slaves model, where one guy would > get to be the master by e.g. taking the lock on an object, and he would then > hand out work to everyone else that was a watch/notify subscriber to the > magic object. It seems like that could be simpler than having workers have > to work out independently what their workload should be, and have the added > bonus of providing a command-like mechanism in addition to continuous > operation. Heh. This could be the method but I caution people that it's a brand-new use case for watch-notify and I'm not too sure how it'd perform. I suspect we'd need to keep the chunks of work pretty large in order to avoid the watch-notify cycle latencies being a limiting factor. ;) Speaking more generally, unless a peer-based model turns out to be infeasible I much prefer that — the systems are sometimes more complicated but generally much more resilient to failures, and tend to be better-designed for recovery than when everything is residing in the master's memory and then has to get reconstructed. -Greg -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html