Hi, Sage and Joao I think there were something I didn't make clear just now, sorry. About the second issue, by "insuring consistency of cross-object operations", I mean: Say, a rbd write X involves two rados object A and B, and turn them into A1 and B1 respectively, and a following rbd write Y turn them in to A2 and B2, when these two operations are replicated, the result on the backup cluster can't be A1, B2 or A2, B1. So, just like Joao said, this is all about ORDERING. My approach is like this: since RADOS can guarantee the order of OPs coming from the same client and targeting the same object, we make "repop"s within the same rbd operation forwarded to the same intermediate node, and intermediate node forward these "repop"s to the backup cluster on two conditions: 1) all "repop"s within the same rbd operation arrived at the intermediate node; 2) all rbd operations, the id of which are less than that of the current rbd operation(the id is a monotonously increasing integer that uniquely identifies a rbd operation, the order of the id indicates the order they are created by the librbd client), are sent to the backup cluster(not replicated, or "ondisk" on the backup cluster). With these two constrains, I think we can insure the order of rbd operations. The first condition makes sure that all "repop"s are all sent to the backup cluster, or none of them are sent, which can insure the consistency of the resulting rbd image if the master cluster crashes when only part of a rbd operation are forwarded to the intermediate node. The second condition can make sure that rbd operations are replicated to the backup cluster in the order that they are created by librbd clients. And since RADOS can guarantee the order of OPs coming from the same client and targeting the same object, we don't have to replicate a rbd operation after its ancestor is replicated, only after it's issued to the backup cluster is enough. I think this should be able to preserve the throughput of the inter-cluster replication procedure. The detail of this approach is shown in Algorithm 17. And since we are implementing this at the RADOS level, we shouldn't directly process "rbd" operations here. So, I think we should involve the concept "object set" to adapt to the concepts of the upper level system like "rbd image". I don't know if I'm considering this in the right way, and I'm looking forward to your opinion. Thanks very much:-) -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html