On Wed, Jan 8, 2020 at 7:35 PM Xuehan Xu <xxhdx1985126@xxxxxxxxx> wrote: > > > I added a few comments, my high level perspective is that it looks > > like an approach for dealing with multiversioned extents which might > > be a component of rados pool level point-in-time globally consistent > > snapshots for purposes like rados pool level cross-cluster > > replication. However, that sort of thing would require a great deal > > of higher level support, so I'd consider the disk layout portion to be > > out of scope for now. Is there another use case you are hoping to > > address with this? > > Hi, sam. Thanks for reviewing the doc:-) > > The main focus of this initiative is about doing efficient > replication/backup. Specifically, we intended to provide higher level > modules, especially to rbd and cephfs, the ability to do very high > rate snapshots (like one snapshot every 5 seconds or even multiple > snapshots within a second) and efficient snapshot diff and > export-diff. I didn't mention it explicitly, but the refcounts in the seastore doc lba tree are intended to permit extent sharing to support the existing snapshot machinery via clone(). > > We thought, with this ability, upper level applications can achieve > near real-time replication that can be compared to the common op-by-op > replication, but with less overhead. Because it doesn't involve any > extra replication-dedicated journal operations. And as multiple write > operations' targeting extents may overlap with each other, even the > op-by-op replication can also avoid extra journal operations, they > inevitably replicate overlapped extents multiple times, while, in > snapshot diff export, only the latest version of the overlapped > extents need to be replicated. It's not clear to me how this versioning scheme changes journaling or pg logging. For recovery, we already track overlapping extents between versions and use cloning appropriately. Can you expand on this portion? > > We thought maybe we can let upper layer applications to choose whether > to replicate their data instead of doing the replication forcibly at > the whole rados pool scale. The existing self-managed snapshot scheme already gives rbd image granularity snapshots and cephfs recursive, subtree granularity snapshots. The difference is that the versioning lives in the hobject_t tuple -- each version is a different object with shared extents. > > Whether this approach can really achieve that goal and whether to do > it is to be discussed, as we also realised that it may not be > cost-effective with respect to the amount of development work:-) I guess I'm not sure what this approach gets us that the existing cloning scheme does not. The main problem with high snapshot rates currently isn't that the ondisk structure doesn't support it, but rather that snapshot stamps are mediated through the monitor. There are reasons for doing it that way -- an rbd client need only issue a single monitor command to get rid of a snapshot and all involved osds will remove the now unnecessary clones asynchronously without requiring the client to track them down. Similarly, the mds needn't find every clone within a subtree -- a potentially expensive operation. I think what I'm missing is how this structure fits into some higher level snapshot scheme you are proposing. -Sam > > Thanks. > _______________________________________________ Dev mailing list -- dev@xxxxxxx To unsubscribe send an email to dev-leave@xxxxxxx