Re: About RADOS level replication

Gregory Farnum <gfarnum@xxxxxxxxxx> · Fri, 4 May 2018 10:57:43 -0700

On Fri, May 4, 2018 at 4:16 AM, Xuehan Xu <xxhdx1985126@xxxxxxxxx> wrote:
>> Hi Xuehan,
>>
>> On Fri, 4 May 2018, Xuehan Xu wrote:
>>> Hi, sage.
>>>
>>> I'm really sorry for bothering you. But we really want to know your
>>> opinion about our motives for implementing a RADOS level replication,
>>> so I think I should fully illustrate them to you.
>>
>> It is no bother at all--thanks for reaching out!  Do you mind if I add
>> ceph-devel to the thread?
>
> Of course not, +devel:-)
>
>>
>>> The main reason is that we have a need to replicate our data in CephFS
>>> between different data centers, because some of our clients want their
>>> data produced in one cluster be replicated to other clusters in other
>>> data centers quickly. We do considered implementing such a mechanism
>>
>> I think the key question here is what kind of consistency we are
>> looking for in the replica/backup cluster.  At the block layer, nothing
>> except point-in-time consistency really works, because there is a local
>> file system sitting on top with specific expectations about consistency.
>> For a file system, it really depends on what the user/application is
>> storing there.  If it's something like a database then the same
>> consistency requirements apply, but if it's home directories, or a
>> document archive, or a bunch of rendering assets, or probably most other
>> random things, then probably something like point-in-time consistency for
>> individual files is sufficient.
>>
>> The other question is about the latency.  A simple strategy might be to
>> periodically spanshot the entire cephfs file system (say, every hour) and
>> efficiently sync things across.  Or even every 10 minutes.  If we pushed
>> this hard enough, we could probably get down to ~1 minute intervals?
>> Might be worth a look.  On the receiving end we'd need to sort out how to
>> stage the complete set of changes before applying them.  (I'm not certain
>> but I think this is roughly how NetApp's SnapMirror feature works.. I
>> think the lag is usually measured in minutes.)
>>
>> I've seen enterprise systems that go to extreme lengths to have very
>> strong volume-wide consistency for their async replication; other systems
>> are pretty ad hoc and opportunistic (think rsync in a loop).  Both models
>> seem to have happy users.
>>
>> Implementing this at the RADOS pool level means that we are stuck with
>> providing point-in-time consistency, because things on top (cephfs in this
>> case) expect it.  If we go up a layer we have more flexibility.
>>
>> Also, doing it a layer up means we can consider other models: what about
>> active/active writes?  If we have a last-writer-wins and/or other conflict
>> resolution approach you could have active users on both cephfs replicas.
>> Or what about 1:N replication?
>>
>>> at the CephFS layer, but as you and other people said, it would double
>>> all the I/O, which could lead to performance degradation. So we want
>>> to find a way to accomplish the replication without severely damaging
>>> the performance of the main cluster

Can you expand more on this "double all the I/O"? Were you thinking
about doing full data journaling the way rbd-mirror works?

Given the description here, I think it would be much easier and more
efficient to implement sync at the CephFS level. The simplest option
is a "smart rsync" that is aware of CephFS' recursive statistics. You
then start at the root, look at each file or folder to see if its
(recursive) timestamp is newer than your last sync[1], and if it is,
you check out the children. Do an rsync on each individual file, and
profit!

Now you have some choices to make on the consistency model — as Sage
says, you may want to take snapshots and do this work on a snapshot so
files are all consistent with each other, though that does impose more
work to take and clean up snapshots. Or you may just do it on a
file-by-file basis, in which case it's even easier to just do this
from a normal client. (Which is not to say you have to; you could also
build it inside the MDS or in some special privileged daemon or
something.)

I find an approach like this attractive for a few reasons.
First of all, it's less invasive to the core code, which means it
interacts with fewer parts of the system, is easier to maintain in
itself, and doesn't impose costs on maintaining other parts of Ceph.
Second, CephFS is unlike RADOS in that it has a centralized,
consistent set of metadata for tracking all its data. We can take
advantage of that metadata (or *add* to it!) to make the job of
synchronizing easier. In RADOS, we are stuck with running algorithms
that parallelize, and being very careful to minimize the central
coordination points. That's good for scale, but very bad for ease of
understanding and development.
-Greg
[1]: This is *slightly* trickier than I make it sound to get right, as
the rstats are updated *lazily*. So you may run a sync at 16:00:00 and
miss a file down the tree that was updated at 15:59:10 because that
change hasn't gotten all the way to the root. You'd probably want a
database associated with the sync daemon that tracks the timestamps
you saw at the previous sync. If you wanted to build this into the MDS
system with its own metadata you could look at how forward scrub works
for a model.

, and we found that if we can reuse
>>> the OSD journal as a cache for the replication, we can probably
>>> achieve that goal. And also, we noticed that the OSD journal only
>>> exists in FileStore, but we thought that, if it's worth it, we can
>>> implement a similar journal in other ObjectStores like BlueStore.
>>> Although implementing such a journal at the ObjectStore level also
>>> double the I/O, but we thought it should be better than implementing
>>> it at a higher level like CephFS, since it avoids doubling I/O
>>> processing steps between ObjectStore level and CephFS level, and we
>>> thought those processing steps could be expensive when the underlying
>>> disk is a fast SSD or NVMe device.
>>
>> This is a good point.  Another approach we've thought about a bit is
>> exposing the CLONE operation and using that to construct an explicit
>> journal object.  E.g., a write operation would write to object A and then
>> clone that same data range to the end of a log object as well.  There's
>> still some additional writes for the description of the IO in the log
>> file, but in the bulk/streaming write case, we mostly don't write the
>> data twice.
>>
>>> Also,  I found that, about our design, there were something that I
>>> didn't make myself clear during the CDM yesterday.
>>>
>>> The first is that, in our design, the main cluster doesn't have to
>>> wait infinitely if the backup cluster goes down or something like that
>>> happens. Only the replication needs to be suspended. During this
>>> suspension, objects that are modified are marked as
>>> need-full-replication, and when the replication resumes, these objects
>>> will be fully copied to the backup cluster. It's only when an object
>>> is undertaking a full copy does those ops targeting it need to wait,
>>> this is like the wait of ops when their target objects are being
>>> recovered or backfilled.
>>
>> I see... This makes sense, although I worry that it would be a bit reason
>> about and debug because it's operating/coordinating at a couple different
>> layers of the stack.
>>
>>> The second is that, in our online clusters, we configured our OSD
>>> journal space to 20GB. So we though if we can use 50% of it for the
>>> replication, it should be able to provide some tolerance for the
>>> jitter of the bandwidth of the replication.
>>>
>>> And with regard to flexibility of the replication, we also considered
>>> some of that. For example, in our design, the replication of an object
>>> is triggered by an op with its "need_replicate" field set to true, so
>>> that the user can decide whether their data needs the replication.
>>>
>>
>> Thanks for being open to a discussion, but also don't assume I have the
>> right answer either :).  For something like this I there are lots of
>> answers, and which is best is a balance a bunch of different things.
>>
>> One way or another, I *am* very interested in getting a multi-site CephFS
>> replication implementation of some sort in place.  Exactly how we do it
>> and what the properties are is still a bit of an open question, though, so
>> this is perfect timing--and I'm eager to hear what requirements you see to
>> help choose an approach!
>>
>> Thanks,
>> sage
>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html