Re: removed_snaps

Sage Weil <sweil@xxxxxxxxxx> · Fri, 13 Oct 2017 21:32:35 +0000 (UTC)

On Fri, 13 Oct 2017, Gregory Farnum wrote:
> On Thu, Oct 12, 2017 at 7:02 AM, Sage Weil <sweil@xxxxxxxxxx> wrote:
> > Just had another thought last night: the mon can preserve the full history
> > of deletions, by epoch.  When the objecter encounters a map gap it can
> > request the removed_snaps over teh gap period from the mon at the same
> > time it's getting the next map (i.e., the oldest full map stored by
> > the mon).  Since this is a pretty rare/excpetional thing, I don't
> > worry much about the extra work for the mon, and it avoids the ugly
> > client-must-crash behavior... a laggy client will always be able to catch
> > up.e
> 
> That seems useful and will probably work in a practical sense, but I'm
> still a bit concerned. There's an in-built assumption here that the
> OSD map epoch of a client is a useful proxy for "has the correct set
> of snapids". And...well, it's a *reasonable* proxy, especially if the
> Objecter starts trimming snapids. But CephFS certainly doesn't have
> any explicit relationship (or even much of an implicit one) between
> the OSDMap epochs and the set of live snapshots. I don't think RBD
> does either, although since it passes around snapids via header
> objects and watch-notify it might come closer?
> 
> I'm tossing around in my head if there's some good way to attach a
> "valid as of this epoch" tag to snapcontexts generated by external
> systems. All snapshot users *do* already maintain a snapid_t for
> versioning that they use; maybe we can tie into or extend it somehow?
> (A trivial but presumably too-slow implementation for CephFS could do
> something like, on every load of a SnapRealm in the MDS, validate the
> snap ids against the monitor's full list and attach the current osd
> epoch to it.)

Yeah, the "snapc valid as of this epoch" is exactly it.

Unfortunately I'm not sure attaching anything to the snapc's 
snap_seq helps, since that's the time point when the snapc was last 
modified, but we actually want a lower bound on when we know the 
snapc to be accurate (i.e., reflect any completed snap creates/deletes).

For librbd, maybe this is the epoch we load the head object.. *if* we also 
modify it to do a 2PC on snap removal.  Although probably this 
problem isn't so bad since we have exclusive locking and the 
exclusive lock holder is the one doing the snap deletions?

For CephFS... yeah.  IIRC the client only hears about SnapRealm changes 
when they happen, but we actually want to roll the "valid as of epoch" 
value forward when no snaps are created or deleted.  Maybe it can be baked 
into the client/mds heartbeat?

Anyway, in order to actually use the value once we have it, the librados 
snapc methods would need to get an additional valid_as_of_epoch argument 
and refresh it periodically.

...

In any case, in order to use the value properly we need to apply any 
removed_snaps between the valid_as_of_epoch and the current epoch at the 
time the op is submitted.  Which means the OSDMap needs to have a window 
of recent removed_snaps that's big enough to do that, and/or the librados 
user has to have a defined way of refreshing itself if its snapc is too 
stale.

The bad news is that my branch removes the removed_snaps entirely from teh 
OSDMap--we only have deltas.  I can add it back it without much trouble.

The better news is that this means we don't have the have a wide window 
between advertising removed vs purged snaps anymore--we can declare them 
purged as soon as they are purged.  Which has the added benefit of being 
able to explicitly track/monitor the 'removing' set of snaps on the mon, 
which seems like something that will be of interest to ops folks.

So.. yeah, this is an open issue.  I think what I have now will 
mostly work, but it may leak clones occasionally if clients are super 
stale.  Maybe that's not a big deal...

> Moving on to the stuff actually written down:
> How comfortable are we with the size of the currently-deleting
> snapshots maps, for computation purposes? I don't have a good way of
> quantifying that cost but I'm definitely tempted to split the sets
> into
> newly_deleted_snaps (for *this epoch*)
> deleting_snaps (which are kept around until removed_snaps_lb_epoch)
> newly_purged_snaps (also for this epoch, which I think is how you have
> it written?)

The code I have now has 2 sets: removed_snaps and purged_snaps.  
purged_snapd was already there before; removed_snaps is the superset that 
includes stuff not yet purged.  (I made it a superset thinking that 
the filter_snapc() thing on every op will be faster if it only has to 
filter against a single set instead of 2 of them, and that matters more 
than the memory.)

> There are also two questions down at the bottom. For (1) I think it's
> good to keep the deleted snaps set for all time (always good for
> debugging!), but we need to be careful: if there is a divergence
> between RADOS' metadata and those of RBD or CephFS, we need a
> mechanism for re-deleting snaps even if they were already zapped.

I'm preserving for all time, but yeah, no mechanism to force a re-removal.  
Hrm.

> For (2), yes the removed_snaps_lb_epoch should be per-pool, not
> global. We don't have any other global snap data, why would we
> introduce a linkage? :)

Yep!

Thanks-
sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html