Re: About RADOS level replication

Sage Weil <sweil@xxxxxxxxxx> · Sun, 3 Jun 2018 20:17:11 +0000 (UTC)

Hi Xuehan,

On Sun, 3 Jun 2018, Xuehan Xu wrote:
> Hi, sage. I think this "snapmirror" way may be better, since it
> doesn't involve the "rstat" lazy updating problem and it wouldn't copy
> non-modified files.
> 
> As you said, to go this way, we have to do a recursive snapshot diff
> calculation down the subtree that we are doing replication in, and
> send the diff the the remote site. I think maybe a similar way to
> rbd's "export-diff/import-diff" mechanism should be taken, so the
> whole replication process can be separate into two sub processes:
> "export-diff" of the subtree on the source filesystem and
> "import-diff" on the remote filesystem. And I think we can separate
> the snapshot diff calculation into two parts: metadata snapshot diff
> and data snapshot diff. And correspondingly, we can implement two new
> api method at the file system layer: metadata_diff and data_diff.
> 
> To implement the metadata_diff, I think we can add a few codes to
> "handle_client_readdir" method to make capable of selecting those
> entries that has been modified during two snapshots. And if the
> clients want to do a diff calculation down the subtree, all they need
> to do is to recursively do this diff-capable "readdir" down the
> subtree. For the data_diff part, I think we can simply model the way
> rbd does. So, the whole export-diff process can be like this, first,
> the "snapmirror" daemon do the one-directory metadata_diff on the
> target directory, then it does data_diff on files that has been
> modified between the two snapshots and meanwhile it does the
> one-directory metadata_diff on subdirs that has been modified, and
> then for each of those subdirs, the "snapmirror" daemon repeats the
> above two steps.

I think that in order to make the metadata_diff efficient, we still need 
to rely on the rstats.  For example, if you modify the file 
/a/b/c/d/e/f/g, then g's ctime will change, but you'll still have to 
traverse the entire hierarchy in order to discover that.  The 
rctime-informed search will let us efficiently find those changes.

...and if we have the rctimes, then the only real difference is whether we 
do a full readdir for the directory and look at each file's ctime and 
rctime, or whether that filtering is done on the MDS side.  It's probably 
a bit faster with the MDS's help, but it needs a special tool, while 
simply modifying rsync to use rctime would work almost as well.

> For the import-diff part, I think we can go this way: first, apply
> diffs of files in the target subtree, and then, do "setattr" all the
> files and directories that has been modified between the two snapshots
> to make their metadata exactly the same as their counter parts on the
> source filesystem.

rsync does with this the appropriate options.

It seems like the weak link in all of this is making sure the rctimes are 
coherent/correct.  We could have a fs-wide sync-like operation that 
flushes all of the rstats down to the root or something.

Am I missing something?

> For the metadata_diff, I've already tried to implement a prototype:
> https://github.com/xxhdx1985126/ceph/commit/d55ea8a23738268e19e7dd6a43bb1a89929e9d22
> Please take a look if any of you guys have the time:-)

The change looks reasonble, but I think it's an optimization, and 
much less important than avoid a traversal of the entire namespace...

sage

> If this approach sounds reasonable to you, can I add a feature to the
> Ceph issue tracker and go on implementing it? Thank you:-)
> 
> On 12 May 2018 at 11:34, Sage Weil <sweil@xxxxxxxxxx> wrote:
> > On Fri, 11 May 2018, Xuehan Xu wrote:
> >> > Given the description here, I think it would be much easier and more
> >> > efficient to implement sync at the CephFS level. The simplest option
> >> > is a "smart rsync" that is aware of CephFS' recursive statistics. You
> >> > then start at the root, look at each file or folder to see if its
> >> > (recursive) timestamp is newer than your last sync[1], and if it is,
> >> > you check out the children. Do an rsync on each individual file, and
> >> > profit!
> >> >
> >> > Now you have some choices to make on the consistency model — as Sage
> >> > says, you may want to take snapshots and do this work on a snapshot so
> >> > files are all consistent with each other, though that does impose more
> >> > work to take and clean up snapshots. Or you may just do it on a
> >> > file-by-file basis, in which case it's even easier to just do this
> >> > from a normal client. (Which is not to say you have to; you could also
> >> > build it inside the MDS or in some special privileged daemon or
> >> > something.)
> >> >
> >> > I find an approach like this attractive for a few reasons.
> >> > First of all, it's less invasive to the core code, which means it
> >> > interacts with fewer parts of the system, is easier to maintain in
> >> > itself, and doesn't impose costs on maintaining other parts of Ceph.
> >> > Second, CephFS is unlike RADOS in that it has a centralized,
> >> > consistent set of metadata for tracking all its data. We can take
> >> > advantage of that metadata (or *add* to it!) to make the job of
> >> > synchronizing easier. In RADOS, we are stuck with running algorithms
> >> > that parallelize, and being very careful to minimize the central
> >> > coordination points. That's good for scale, but very bad for ease of
> >> > understanding and development.
> >> > -Greg
> >> > [1]: This is *slightly* trickier than I make it sound to get right, as
> >> > the rstats are updated *lazily*. So you may run a sync at 16:00:00 and
> >> > miss a file down the tree that was updated at 15:59:10 because that
> >> > change hasn't gotten all the way to the root. You'd probably want a
> >> > database associated with the sync daemon that tracks the timestamps
> >> > you saw at the previous sync. If you wanted to build this into the MDS
> >> > system with its own metadata you could look at how forward scrub works
> >> > for a model.
> >>
> >> I think this "smart rsync" should be an appropriate way for our
> >> current need. And I think that maybe we can reuse the snapshot
> >> mechanism in this "smart sync". When we find some files has been
> >> modified, we make snapshots only for those files that are going to be
> >> copied, and apply the diffs between snapshots to the other clusters.
> >> In this way, I think we should be able to save those bandwidth
> >> (network and disk) used for copying unmodified area of files. Is this
> >> right?
> >
> > Yes, in that if you teach rsync about cephfs recursive stats, you can just
> > as easily run it on a snapshot as on the live data.  One other comment,
> > too: as Greg mentioned the cephfs rstats are somewhat lazily propagated to
> > the client.  However, I can also imagine, if we go down this path, that we
> > might work out a way for the client to request the latest, consistent
> > rstats from the MDS.  (I think we may already have this for the snapshot
> > view, or be quite close to it--Zheng would know best as he's been
> > actively working on this code.)
> >
> > In any case, I think it would be awesome to try to modify normal rsync
> > in a way that makes it notice it is on a CephFS mount, and if so, use the
> > rctime to avoid checking subdirectories that have not changed... either
> > via a special flag, or automatically based on the fs time returned by
> > statfs.
> >
> >> Finally, I think, although we don't have an urgent need for an "ops
> >> replication" mechanism in CephFS for now, we should take precautions.
> >> So basically, I think maybe we can implement the final "top-level"
> >> cephfs replication in three steps: first, we implement a "smart
> >> rsync"; then a replication mechanism with point-in-time consistency at
> >> file level; and finally, when we have all the man power and the
> >> resources, a replication mechanism with system-wide point-in-time
> >> consistency. Does this sound reasonable to you? Thanks.
> >
> > I'm not immediately sure the 2 and 3 steps would look like from this
> > description, but agree that rsync should be step 1.
> >
> > I have one other option, though, that may or not be what you had in mind.
> > Let's call is the "snapmirror" mode, since it's roughly what the
> > netapp feature with that name is doing:
> >
> > 1- Automatically take snapshot S of the source filesystem.
> >
> > 2- Do an efficiently traversal of the new snapshot S vs the previous one
> > (S-1) and send the diff to the remote site.  Once the whole diff is
> > transferred, apply it to the remote fs.  (Or, on the remote site, apply
> > the changes directly.  If we fail to do the whole thing, roll back to
> > snapshot S-1).
> >
> > 3- take snapshot S on remote fs
> >
> > 4- remove snapshot S-1 on local and remote fs
> >
> > 5- wait N minutes, then repeat
> >
> > I think the key here, in the end, is to make the setup/configuration of
> > this sort of thing easy to understand, and to provide an easy, transparent
> > view of the current sync status.
> >
> > sage
> 
>