Re: removed_snaps

Gregory Farnum <gfarnum@xxxxxxxxxx> · Fri, 13 Oct 2017 10:22:47 -0700

On Thu, Oct 12, 2017 at 7:02 AM, Sage Weil <sweil@xxxxxxxxxx> wrote:
> Just had another thought last night: the mon can preserve the full history
> of deletions, by epoch.  When the objecter encounters a map gap it can
> request the removed_snaps over teh gap period from the mon at the same
> time it's getting the next map (i.e., the oldest full map stored by
> the mon).  Since this is a pretty rare/excpetional thing, I don't
> worry much about the extra work for the mon, and it avoids the ugly
> client-must-crash behavior... a laggy client will always be able to catch
> up.e

That seems useful and will probably work in a practical sense, but I'm
still a bit concerned. There's an in-built assumption here that the
OSD map epoch of a client is a useful proxy for "has the correct set
of snapids". And...well, it's a *reasonable* proxy, especially if the
Objecter starts trimming snapids. But CephFS certainly doesn't have
any explicit relationship (or even much of an implicit one) between
the OSDMap epochs and the set of live snapshots. I don't think RBD
does either, although since it passes around snapids via header
objects and watch-notify it might come closer?

I'm tossing around in my head if there's some good way to attach a
"valid as of this epoch" tag to snapcontexts generated by external
systems. All snapshot users *do* already maintain a snapid_t for
versioning that they use; maybe we can tie into or extend it somehow?
(A trivial but presumably too-slow implementation for CephFS could do
something like, on every load of a SnapRealm in the MDS, validate the
snap ids against the monitor's full list and attach the current osd
epoch to it.)

Moving on to the stuff actually written down:
How comfortable are we with the size of the currently-deleting
snapshots maps, for computation purposes? I don't have a good way of
quantifying that cost but I'm definitely tempted to split the sets
into
newly_deleted_snaps (for *this epoch*)
deleting_snaps (which are kept around until removed_snaps_lb_epoch)
newly_purged_snaps (also for this epoch, which I think is how you have
it written?)

There are also two questions down at the bottom. For (1) I think it's
good to keep the deleted snaps set for all time (always good for
debugging!), but we need to be careful: if there is a divergence
between RADOS' metadata and those of RBD or CephFS, we need a
mechanism for re-deleting snaps even if they were already zapped.
For (2), yes the removed_snaps_lb_epoch should be per-pool, not
global. We don't have any other global snap data, why would we
introduce a linkage? :)
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html