On Fri, 28 Jun 2019, Gregory Farnum wrote: > On Fri, Jun 28, 2019 at 7:50 AM Sage Weil <sweil@xxxxxxxxxx> wrote: > > > > Hi Myoungwon, > > > > I was thinking about how a refcounted cas pool would interact with > > snapshots and it occurred to me that dropping refs when an object is > > deleted may break snapshotted versions of that object. If object A has > > a ref to chunk X, is snapshotted, then A is deleted, we'll (currently) > > drop the ref to X and remove it. That means that A can't be read. > > > > One way to get around that would be to mirror snaps from the source pool > > to the chunk pool--this is how cache tiering works. The problem I see > > there is that I'd hoped to allow multiple pools to share/consume the same > > chunk pool, but each pool has its own snapid namespace. > > > > Another would be to bake the refs more deepling into the source rados pool > > so that the refs are only dropped after all clones also drop the ref. > > That is harder to track, though, since I think you'd need to examine all > > of the clones to know whether the ref is truly gone. Unless we embed > > even more metadata in the SnapSet--something analogous to clone_overlap to > > identifying the chunks. That seems like it will bloat that structure, > > though. > > > > Other ideas? > > Is there much design work around refcounting and snapshots yet? > > I haven't thought it through much but one possibility is that each > on-disk clone counts as its own reference, and on a write to the > manifest object you increment the reference to all the chunks in > common. When snaptrimming finally removes a clone, it has to decrement > all the chunk references contained in the manifest. > > I don't love this for the extra trimming work and remote reference > updates, but it's one way to keep the complexity of the data > structures down. > > Other options: > * Force 1:1 mapping. Not sure how good or bad this is since I haven't > seen a lot of CAS pool discussion. This is implemented by https://github.com/ceph/ceph/pull/29283. The concern I have with this approach is that the any write that triggers a clone creation may need to block while all of the ref counts for the clone are incremented. This is slow, and also introduces one more window for an OSD failure to lead to leaked references (not critical but not great either). Here's a new idea: Currently all of the write operations populate the OpContext modified_ranges map, which is then subtracted from the most recent clone's clone_overlap in the SnapSet. We could use that to take one of two paths: 1) If a newly dereferenced (by head) chunk overlaps with the most recent clone, do nothing--that clone still has a reference to it. 2) If a newly dereferenced (by head) chunk does NOT overlap with the most recent clone, then it is the only referant, and we can decrement it after we apply the update (like we do today). Then, trim_object() needs to be smart. When a clone is removed, it needs to compare the clone's chunks to the adjacent clones or head, and make a similar determination of whether the chunk reference is unique to the clone or shared by one of its neighbors. I think this is possible by inspecting *only* the clone_overlap, which is in the SnapSet, and already always present in memory. What do you think? sage > * no longer giving each pool its own snapshot namespace. Not sure this > was a great design decision to begin with; would require updating > CephFS snap allocation but I don't think anything else outside the > monitors. > * Disallowing snapshots on manifest-based objects/pools. What are the > target workloads for these? > -Greg > > > > > sage > > _______________________________________________ > > Dev mailing list -- dev@xxxxxxx > > To unsubscribe send an email to dev-leave@xxxxxxx > _______________________________________________ > Dev mailing list -- dev@xxxxxxx > To unsubscribe send an email to dev-leave@xxxxxxx > > _______________________________________________ Dev mailing list -- dev@xxxxxxx To unsubscribe send an email to dev-leave@xxxxxxx