> Hi Xuehan, > > On Fri, 4 May 2018, Xuehan Xu wrote: >> Hi, sage. >> >> I'm really sorry for bothering you. But we really want to know your >> opinion about our motives for implementing a RADOS level replication, >> so I think I should fully illustrate them to you. > > It is no bother at all--thanks for reaching out! Do you mind if I add > ceph-devel to the thread? Of course not, +devel:-) > >> The main reason is that we have a need to replicate our data in CephFS >> between different data centers, because some of our clients want their >> data produced in one cluster be replicated to other clusters in other >> data centers quickly. We do considered implementing such a mechanism > > I think the key question here is what kind of consistency we are > looking for in the replica/backup cluster. At the block layer, nothing > except point-in-time consistency really works, because there is a local > file system sitting on top with specific expectations about consistency. > For a file system, it really depends on what the user/application is > storing there. If it's something like a database then the same > consistency requirements apply, but if it's home directories, or a > document archive, or a bunch of rendering assets, or probably most other > random things, then probably something like point-in-time consistency for > individual files is sufficient. > > The other question is about the latency. A simple strategy might be to > periodically spanshot the entire cephfs file system (say, every hour) and > efficiently sync things across. Or even every 10 minutes. If we pushed > this hard enough, we could probably get down to ~1 minute intervals? > Might be worth a look. On the receiving end we'd need to sort out how to > stage the complete set of changes before applying them. (I'm not certain > but I think this is roughly how NetApp's SnapMirror feature works.. I > think the lag is usually measured in minutes.) > > I've seen enterprise systems that go to extreme lengths to have very > strong volume-wide consistency for their async replication; other systems > are pretty ad hoc and opportunistic (think rsync in a loop). Both models > seem to have happy users. > > Implementing this at the RADOS pool level means that we are stuck with > providing point-in-time consistency, because things on top (cephfs in this > case) expect it. If we go up a layer we have more flexibility. > > Also, doing it a layer up means we can consider other models: what about > active/active writes? If we have a last-writer-wins and/or other conflict > resolution approach you could have active users on both cephfs replicas. > Or what about 1:N replication? > >> at the CephFS layer, but as you and other people said, it would double >> all the I/O, which could lead to performance degradation. So we want >> to find a way to accomplish the replication without severely damaging >> the performance of the main cluster, and we found that if we can reuse >> the OSD journal as a cache for the replication, we can probably >> achieve that goal. And also, we noticed that the OSD journal only >> exists in FileStore, but we thought that, if it's worth it, we can >> implement a similar journal in other ObjectStores like BlueStore. >> Although implementing such a journal at the ObjectStore level also >> double the I/O, but we thought it should be better than implementing >> it at a higher level like CephFS, since it avoids doubling I/O >> processing steps between ObjectStore level and CephFS level, and we >> thought those processing steps could be expensive when the underlying >> disk is a fast SSD or NVMe device. > > This is a good point. Another approach we've thought about a bit is > exposing the CLONE operation and using that to construct an explicit > journal object. E.g., a write operation would write to object A and then > clone that same data range to the end of a log object as well. There's > still some additional writes for the description of the IO in the log > file, but in the bulk/streaming write case, we mostly don't write the > data twice. > >> Also, I found that, about our design, there were something that I >> didn't make myself clear during the CDM yesterday. >> >> The first is that, in our design, the main cluster doesn't have to >> wait infinitely if the backup cluster goes down or something like that >> happens. Only the replication needs to be suspended. During this >> suspension, objects that are modified are marked as >> need-full-replication, and when the replication resumes, these objects >> will be fully copied to the backup cluster. It's only when an object >> is undertaking a full copy does those ops targeting it need to wait, >> this is like the wait of ops when their target objects are being >> recovered or backfilled. > > I see... This makes sense, although I worry that it would be a bit reason > about and debug because it's operating/coordinating at a couple different > layers of the stack. > >> The second is that, in our online clusters, we configured our OSD >> journal space to 20GB. So we though if we can use 50% of it for the >> replication, it should be able to provide some tolerance for the >> jitter of the bandwidth of the replication. >> >> And with regard to flexibility of the replication, we also considered >> some of that. For example, in our design, the replication of an object >> is triggered by an op with its "need_replicate" field set to true, so >> that the user can decide whether their data needs the replication. >> > > Thanks for being open to a discussion, but also don't assume I have the > right answer either :). For something like this I there are lots of > answers, and which is best is a balance a bunch of different things. > > One way or another, I *am* very interested in getting a multi-site CephFS > replication implementation of some sort in place. Exactly how we do it > and what the properties are is still a bit of an open question, though, so > this is perfect timing--and I'm eager to hear what requirements you see to > help choose an approach! > > Thanks, > sage > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html