Um.... Sorry, I just read my algorithm 17 again, it seems that it doesn't have condition 2...... I think I just got things confused, it was 04:00 AM and I was really sleepy then. Please forgive me. I'll add this into that algorithm. Really sorry. On 3 August 2017 at 04:05, Xuehan Xu <xxhdx1985126@xxxxxxxxx> wrote: > Hi, Sage and Joao > > I think there were something I didn't make clear just now, sorry. > > About the second issue, by "insuring consistency of cross-object > operations", I mean: > > Say, a rbd write X involves two rados object A and B, and turn > them into A1 and B1 respectively, and a following rbd write Y turn > them in to A2 and B2, when these two operations are replicated, the > result on the backup cluster can't be A1, B2 or A2, B1. > > So, just like Joao said, this is all about ORDERING. > > My approach is like this: since RADOS can guarantee the order of OPs > coming from the same client and targeting the same object, we make > "repop"s within the same rbd operation forwarded to the same > intermediate node, and intermediate node forward these "repop"s to the > backup cluster on two conditions: 1) all "repop"s within the same rbd > operation arrived at the intermediate node; 2) all rbd operations, the > id of which are less than that of the current rbd operation(the id is > a monotonously increasing integer that uniquely identifies a rbd > operation, the order of the id indicates the order they are created by > the librbd client), are sent to the backup cluster(not replicated, or > "ondisk" on the backup cluster). > > With these two constrains, I think we can insure the order of rbd > operations. The first condition makes sure that all "repop"s are all > sent to the backup cluster, or none of them are sent, which can insure > the consistency of the resulting rbd image if the master cluster > crashes when only part of a rbd operation are forwarded to the > intermediate node. The second condition can make sure that rbd > operations are replicated to the backup cluster in the order that they > are created by librbd clients. And since RADOS can guarantee the order > of OPs coming from the same client and targeting the same object, we > don't have to replicate a rbd operation after its ancestor is > replicated, only after it's issued to the backup cluster is enough. I > think this should be able to preserve the throughput of the > inter-cluster replication procedure. > > The detail of this approach is shown in Algorithm 17. > > And since we are implementing this at the RADOS level, we shouldn't > directly process "rbd" operations here. So, I think we should involve > the concept "object set" to adapt to the concepts of the upper level > system like "rbd image". > > I don't know if I'm considering this in the right way, and I'm looking > forward to your opinion. Thanks very much:-) -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html