Hi Xuehan, On Sun, 3 Jun 2018, Xuehan Xu wrote: > Hi, sage. I think this "snapmirror" way may be better, since it > doesn't involve the "rstat" lazy updating problem and it wouldn't copy > non-modified files. > > As you said, to go this way, we have to do a recursive snapshot diff > calculation down the subtree that we are doing replication in, and > send the diff the the remote site. I think maybe a similar way to > rbd's "export-diff/import-diff" mechanism should be taken, so the > whole replication process can be separate into two sub processes: > "export-diff" of the subtree on the source filesystem and > "import-diff" on the remote filesystem. And I think we can separate > the snapshot diff calculation into two parts: metadata snapshot diff > and data snapshot diff. And correspondingly, we can implement two new > api method at the file system layer: metadata_diff and data_diff. > > To implement the metadata_diff, I think we can add a few codes to > "handle_client_readdir" method to make capable of selecting those > entries that has been modified during two snapshots. And if the > clients want to do a diff calculation down the subtree, all they need > to do is to recursively do this diff-capable "readdir" down the > subtree. For the data_diff part, I think we can simply model the way > rbd does. So, the whole export-diff process can be like this, first, > the "snapmirror" daemon do the one-directory metadata_diff on the > target directory, then it does data_diff on files that has been > modified between the two snapshots and meanwhile it does the > one-directory metadata_diff on subdirs that has been modified, and > then for each of those subdirs, the "snapmirror" daemon repeats the > above two steps. I think that in order to make the metadata_diff efficient, we still need to rely on the rstats. For example, if you modify the file /a/b/c/d/e/f/g, then g's ctime will change, but you'll still have to traverse the entire hierarchy in order to discover that. The rctime-informed search will let us efficiently find those changes. ...and if we have the rctimes, then the only real difference is whether we do a full readdir for the directory and look at each file's ctime and rctime, or whether that filtering is done on the MDS side. It's probably a bit faster with the MDS's help, but it needs a special tool, while simply modifying rsync to use rctime would work almost as well. > For the import-diff part, I think we can go this way: first, apply > diffs of files in the target subtree, and then, do "setattr" all the > files and directories that has been modified between the two snapshots > to make their metadata exactly the same as their counter parts on the > source filesystem. rsync does with this the appropriate options. It seems like the weak link in all of this is making sure the rctimes are coherent/correct. We could have a fs-wide sync-like operation that flushes all of the rstats down to the root or something. Am I missing something? > For the metadata_diff, I've already tried to implement a prototype: > https://github.com/xxhdx1985126/ceph/commit/d55ea8a23738268e19e7dd6a43bb1a89929e9d22 > Please take a look if any of you guys have the time:-) The change looks reasonble, but I think it's an optimization, and much less important than avoid a traversal of the entire namespace... sage > If this approach sounds reasonable to you, can I add a feature to the > Ceph issue tracker and go on implementing it? Thank you:-) > > On 12 May 2018 at 11:34, Sage Weil <sweil@xxxxxxxxxx> wrote: > > On Fri, 11 May 2018, Xuehan Xu wrote: > >> > Given the description here, I think it would be much easier and more > >> > efficient to implement sync at the CephFS level. The simplest option > >> > is a "smart rsync" that is aware of CephFS' recursive statistics. You > >> > then start at the root, look at each file or folder to see if its > >> > (recursive) timestamp is newer than your last sync[1], and if it is, > >> > you check out the children. Do an rsync on each individual file, and > >> > profit! > >> > > >> > Now you have some choices to make on the consistency model — as Sage > >> > says, you may want to take snapshots and do this work on a snapshot so > >> > files are all consistent with each other, though that does impose more > >> > work to take and clean up snapshots. Or you may just do it on a > >> > file-by-file basis, in which case it's even easier to just do this > >> > from a normal client. (Which is not to say you have to; you could also > >> > build it inside the MDS or in some special privileged daemon or > >> > something.) > >> > > >> > I find an approach like this attractive for a few reasons. > >> > First of all, it's less invasive to the core code, which means it > >> > interacts with fewer parts of the system, is easier to maintain in > >> > itself, and doesn't impose costs on maintaining other parts of Ceph. > >> > Second, CephFS is unlike RADOS in that it has a centralized, > >> > consistent set of metadata for tracking all its data. We can take > >> > advantage of that metadata (or *add* to it!) to make the job of > >> > synchronizing easier. In RADOS, we are stuck with running algorithms > >> > that parallelize, and being very careful to minimize the central > >> > coordination points. That's good for scale, but very bad for ease of > >> > understanding and development. > >> > -Greg > >> > [1]: This is *slightly* trickier than I make it sound to get right, as > >> > the rstats are updated *lazily*. So you may run a sync at 16:00:00 and > >> > miss a file down the tree that was updated at 15:59:10 because that > >> > change hasn't gotten all the way to the root. You'd probably want a > >> > database associated with the sync daemon that tracks the timestamps > >> > you saw at the previous sync. If you wanted to build this into the MDS > >> > system with its own metadata you could look at how forward scrub works > >> > for a model. > >> > >> I think this "smart rsync" should be an appropriate way for our > >> current need. And I think that maybe we can reuse the snapshot > >> mechanism in this "smart sync". When we find some files has been > >> modified, we make snapshots only for those files that are going to be > >> copied, and apply the diffs between snapshots to the other clusters. > >> In this way, I think we should be able to save those bandwidth > >> (network and disk) used for copying unmodified area of files. Is this > >> right? > > > > Yes, in that if you teach rsync about cephfs recursive stats, you can just > > as easily run it on a snapshot as on the live data. One other comment, > > too: as Greg mentioned the cephfs rstats are somewhat lazily propagated to > > the client. However, I can also imagine, if we go down this path, that we > > might work out a way for the client to request the latest, consistent > > rstats from the MDS. (I think we may already have this for the snapshot > > view, or be quite close to it--Zheng would know best as he's been > > actively working on this code.) > > > > In any case, I think it would be awesome to try to modify normal rsync > > in a way that makes it notice it is on a CephFS mount, and if so, use the > > rctime to avoid checking subdirectories that have not changed... either > > via a special flag, or automatically based on the fs time returned by > > statfs. > > > >> Finally, I think, although we don't have an urgent need for an "ops > >> replication" mechanism in CephFS for now, we should take precautions. > >> So basically, I think maybe we can implement the final "top-level" > >> cephfs replication in three steps: first, we implement a "smart > >> rsync"; then a replication mechanism with point-in-time consistency at > >> file level; and finally, when we have all the man power and the > >> resources, a replication mechanism with system-wide point-in-time > >> consistency. Does this sound reasonable to you? Thanks. > > > > I'm not immediately sure the 2 and 3 steps would look like from this > > description, but agree that rsync should be step 1. > > > > I have one other option, though, that may or not be what you had in mind. > > Let's call is the "snapmirror" mode, since it's roughly what the > > netapp feature with that name is doing: > > > > 1- Automatically take snapshot S of the source filesystem. > > > > 2- Do an efficiently traversal of the new snapshot S vs the previous one > > (S-1) and send the diff to the remote site. Once the whole diff is > > transferred, apply it to the remote fs. (Or, on the remote site, apply > > the changes directly. If we fail to do the whole thing, roll back to > > snapshot S-1). > > > > 3- take snapshot S on remote fs > > > > 4- remove snapshot S-1 on local and remote fs > > > > 5- wait N minutes, then repeat > > > > I think the key here, in the end, is to make the setup/configuration of > > this sort of thing easy to understand, and to provide an easy, transparent > > view of the current sync status. > > > > sage > >