On Fri, 13 Oct 2017, Gregory Farnum wrote: > On Thu, Oct 12, 2017 at 7:02 AM, Sage Weil <sweil@xxxxxxxxxx> wrote: > > Just had another thought last night: the mon can preserve the full history > > of deletions, by epoch. When the objecter encounters a map gap it can > > request the removed_snaps over teh gap period from the mon at the same > > time it's getting the next map (i.e., the oldest full map stored by > > the mon). Since this is a pretty rare/excpetional thing, I don't > > worry much about the extra work for the mon, and it avoids the ugly > > client-must-crash behavior... a laggy client will always be able to catch > > up.e > > That seems useful and will probably work in a practical sense, but I'm > still a bit concerned. There's an in-built assumption here that the > OSD map epoch of a client is a useful proxy for "has the correct set > of snapids". And...well, it's a *reasonable* proxy, especially if the > Objecter starts trimming snapids. But CephFS certainly doesn't have > any explicit relationship (or even much of an implicit one) between > the OSDMap epochs and the set of live snapshots. I don't think RBD > does either, although since it passes around snapids via header > objects and watch-notify it might come closer? > > I'm tossing around in my head if there's some good way to attach a > "valid as of this epoch" tag to snapcontexts generated by external > systems. All snapshot users *do* already maintain a snapid_t for > versioning that they use; maybe we can tie into or extend it somehow? > (A trivial but presumably too-slow implementation for CephFS could do > something like, on every load of a SnapRealm in the MDS, validate the > snap ids against the monitor's full list and attach the current osd > epoch to it.) Yeah, the "snapc valid as of this epoch" is exactly it. Unfortunately I'm not sure attaching anything to the snapc's snap_seq helps, since that's the time point when the snapc was last modified, but we actually want a lower bound on when we know the snapc to be accurate (i.e., reflect any completed snap creates/deletes). For librbd, maybe this is the epoch we load the head object.. *if* we also modify it to do a 2PC on snap removal. Although probably this problem isn't so bad since we have exclusive locking and the exclusive lock holder is the one doing the snap deletions? For CephFS... yeah. IIRC the client only hears about SnapRealm changes when they happen, but we actually want to roll the "valid as of epoch" value forward when no snaps are created or deleted. Maybe it can be baked into the client/mds heartbeat? Anyway, in order to actually use the value once we have it, the librados snapc methods would need to get an additional valid_as_of_epoch argument and refresh it periodically. ... In any case, in order to use the value properly we need to apply any removed_snaps between the valid_as_of_epoch and the current epoch at the time the op is submitted. Which means the OSDMap needs to have a window of recent removed_snaps that's big enough to do that, and/or the librados user has to have a defined way of refreshing itself if its snapc is too stale. The bad news is that my branch removes the removed_snaps entirely from teh OSDMap--we only have deltas. I can add it back it without much trouble. The better news is that this means we don't have the have a wide window between advertising removed vs purged snaps anymore--we can declare them purged as soon as they are purged. Which has the added benefit of being able to explicitly track/monitor the 'removing' set of snaps on the mon, which seems like something that will be of interest to ops folks. So.. yeah, this is an open issue. I think what I have now will mostly work, but it may leak clones occasionally if clients are super stale. Maybe that's not a big deal... > Moving on to the stuff actually written down: > How comfortable are we with the size of the currently-deleting > snapshots maps, for computation purposes? I don't have a good way of > quantifying that cost but I'm definitely tempted to split the sets > into > newly_deleted_snaps (for *this epoch*) > deleting_snaps (which are kept around until removed_snaps_lb_epoch) > newly_purged_snaps (also for this epoch, which I think is how you have > it written?) The code I have now has 2 sets: removed_snaps and purged_snaps. purged_snapd was already there before; removed_snaps is the superset that includes stuff not yet purged. (I made it a superset thinking that the filter_snapc() thing on every op will be faster if it only has to filter against a single set instead of 2 of them, and that matters more than the memory.) > There are also two questions down at the bottom. For (1) I think it's > good to keep the deleted snaps set for all time (always good for > debugging!), but we need to be careful: if there is a divergence > between RADOS' metadata and those of RBD or CephFS, we need a > mechanism for re-deleting snaps even if they were already zapped. I'm preserving for all time, but yeah, no mechanism to force a re-removal. Hrm. > For (2), yes the removed_snaps_lb_epoch should be per-pool, not > global. We don't have any other global snap data, why would we > introduce a linkage? :) Yep! Thanks- sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html