On Thu, Oct 12, 2017 at 7:02 AM, Sage Weil <sweil@xxxxxxxxxx> wrote: > Just had another thought last night: the mon can preserve the full history > of deletions, by epoch. When the objecter encounters a map gap it can > request the removed_snaps over teh gap period from the mon at the same > time it's getting the next map (i.e., the oldest full map stored by > the mon). Since this is a pretty rare/excpetional thing, I don't > worry much about the extra work for the mon, and it avoids the ugly > client-must-crash behavior... a laggy client will always be able to catch > up.e That seems useful and will probably work in a practical sense, but I'm still a bit concerned. There's an in-built assumption here that the OSD map epoch of a client is a useful proxy for "has the correct set of snapids". And...well, it's a *reasonable* proxy, especially if the Objecter starts trimming snapids. But CephFS certainly doesn't have any explicit relationship (or even much of an implicit one) between the OSDMap epochs and the set of live snapshots. I don't think RBD does either, although since it passes around snapids via header objects and watch-notify it might come closer? I'm tossing around in my head if there's some good way to attach a "valid as of this epoch" tag to snapcontexts generated by external systems. All snapshot users *do* already maintain a snapid_t for versioning that they use; maybe we can tie into or extend it somehow? (A trivial but presumably too-slow implementation for CephFS could do something like, on every load of a SnapRealm in the MDS, validate the snap ids against the monitor's full list and attach the current osd epoch to it.) Moving on to the stuff actually written down: How comfortable are we with the size of the currently-deleting snapshots maps, for computation purposes? I don't have a good way of quantifying that cost but I'm definitely tempted to split the sets into newly_deleted_snaps (for *this epoch*) deleting_snaps (which are kept around until removed_snaps_lb_epoch) newly_purged_snaps (also for this epoch, which I think is how you have it written?) There are also two questions down at the bottom. For (1) I think it's good to keep the deleted snaps set for all time (always good for debugging!), but we need to be careful: if there is a divergence between RADOS' metadata and those of RBD or CephFS, we need a mechanism for re-deleting snaps even if they were already zapped. For (2), yes the removed_snaps_lb_epoch should be per-pool, not global. We don't have any other global snap data, why would we introduce a linkage? :) -Greg -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html