Re: About RADOS level replication

Sage Weil <sweil@xxxxxxxxxx> · Sat, 12 May 2018 15:34:56 +0000 (UTC)

On Fri, 11 May 2018, Xuehan Xu wrote:
> > Given the description here, I think it would be much easier and more
> > efficient to implement sync at the CephFS level. The simplest option
> > is a "smart rsync" that is aware of CephFS' recursive statistics. You
> > then start at the root, look at each file or folder to see if its
> > (recursive) timestamp is newer than your last sync[1], and if it is,
> > you check out the children. Do an rsync on each individual file, and
> > profit!
> >
> > Now you have some choices to make on the consistency model — as Sage
> > says, you may want to take snapshots and do this work on a snapshot so
> > files are all consistent with each other, though that does impose more
> > work to take and clean up snapshots. Or you may just do it on a
> > file-by-file basis, in which case it's even easier to just do this
> > from a normal client. (Which is not to say you have to; you could also
> > build it inside the MDS or in some special privileged daemon or
> > something.)
> >
> > I find an approach like this attractive for a few reasons.
> > First of all, it's less invasive to the core code, which means it
> > interacts with fewer parts of the system, is easier to maintain in
> > itself, and doesn't impose costs on maintaining other parts of Ceph.
> > Second, CephFS is unlike RADOS in that it has a centralized,
> > consistent set of metadata for tracking all its data. We can take
> > advantage of that metadata (or *add* to it!) to make the job of
> > synchronizing easier. In RADOS, we are stuck with running algorithms
> > that parallelize, and being very careful to minimize the central
> > coordination points. That's good for scale, but very bad for ease of
> > understanding and development.
> > -Greg
> > [1]: This is *slightly* trickier than I make it sound to get right, as
> > the rstats are updated *lazily*. So you may run a sync at 16:00:00 and
> > miss a file down the tree that was updated at 15:59:10 because that
> > change hasn't gotten all the way to the root. You'd probably want a
> > database associated with the sync daemon that tracks the timestamps
> > you saw at the previous sync. If you wanted to build this into the MDS
> > system with its own metadata you could look at how forward scrub works
> > for a model.
> 
> I think this "smart rsync" should be an appropriate way for our
> current need. And I think that maybe we can reuse the snapshot
> mechanism in this "smart sync". When we find some files has been
> modified, we make snapshots only for those files that are going to be
> copied, and apply the diffs between snapshots to the other clusters.
> In this way, I think we should be able to save those bandwidth
> (network and disk) used for copying unmodified area of files. Is this
> right?

Yes, in that if you teach rsync about cephfs recursive stats, you can just 
as easily run it on a snapshot as on the live data.  One other comment, 
too: as Greg mentioned the cephfs rstats are somewhat lazily propagated to 
the client.  However, I can also imagine, if we go down this path, that we 
might work out a way for the client to request the latest, consistent 
rstats from the MDS.  (I think we may already have this for the snapshot 
view, or be quite close to it--Zheng would know best as he's been 
actively working on this code.)

In any case, I think it would be awesome to try to modify normal rsync 
in a way that makes it notice it is on a CephFS mount, and if so, use the 
rctime to avoid checking subdirectories that have not changed... either 
via a special flag, or automatically based on the fs time returned by 
statfs.

> Finally, I think, although we don't have an urgent need for an "ops
> replication" mechanism in CephFS for now, we should take precautions.
> So basically, I think maybe we can implement the final "top-level"
> cephfs replication in three steps: first, we implement a "smart
> rsync"; then a replication mechanism with point-in-time consistency at
> file level; and finally, when we have all the man power and the
> resources, a replication mechanism with system-wide point-in-time
> consistency. Does this sound reasonable to you? Thanks.

I'm not immediately sure the 2 and 3 steps would look like from this 
description, but agree that rsync should be step 1.

I have one other option, though, that may or not be what you had in mind.  
Let's call is the "snapmirror" mode, since it's roughly what the 
netapp feature with that name is doing:

1- Automatically take snapshot S of the source filesystem.

2- Do an efficiently traversal of the new snapshot S vs the previous one 
(S-1) and send the diff to the remote site.  Once the whole diff is 
transferred, apply it to the remote fs.  (Or, on the remote site, apply 
the changes directly.  If we fail to do the whole thing, roll back to 
snapshot S-1).

3- take snapshot S on remote fs

4- remove snapshot S-1 on local and remote fs

5- wait N minutes, then repeat

I think the key here, in the end, is to make the setup/configuration of 
this sort of thing easy to understand, and to provide an easy, transparent 
view of the current sync status.

sage