RE: RADOS: Deleting all objects in a namespace

"Puerta, Ernesto (Nokia - ES/Madrid)" <ernesto.puerta@xxxxxxxxx> · Wed, 31 May 2017 10:47:34 +0000

Based on this discussion on the delayed deletes, has there already been any previous discussion/effort on expired TTL'd deletes? This would align Ceph to other distributed (or SSTable-based) systems, like Cassandra, which mark data as deleted (aka tombstones) and then consolidate them during scheduled overnight/off-peak timeframes (usually at compaction stage).

Ernesto

> -----Original Message-----
> From: ceph-devel-owner@xxxxxxxxxxxxxxx [mailto:ceph-devel-
> owner@xxxxxxxxxxxxxxx] On Behalf Of Gregory Farnum
> Sent: miércoles, 24 de mayo de 2017 22:47
> To: Sage Weil <sweil@xxxxxxxxxx>; John Spray <jspray@xxxxxxxxxx>
> Cc: Ceph Development <ceph-devel@xxxxxxxxxxxxxxx>
> Subject: Re: RADOS: Deleting all objects in a namespace
> 
> On Wed, May 24, 2017 at 6:32 AM Sage Weil <sweil@xxxxxxxxxx> wrote:
> >
> > On Wed, 24 May 2017, John Spray wrote:
> > > On Wed, May 24, 2017 at 6:12 AM, Gregory Farnum
> <gfarnum@xxxxxxxxxx> wrote:
> > > > On Tue, May 23, 2017 at 6:10 AM John Spray <jspray@xxxxxxxxxx>
> wrote:
> > > >>
> > > >> Soon, we'll probably be letting multiple CephFS filesystems use
> > > >> the same data and metadata pools, where the filesystems are
> > > >> separated by rados namespace.
> > > >>
> > > >> When removing filesystems, in the interests of robustness and
> > > >> speed, I'd like to be able to delete all objects in a namespace
> > > >> -- otherwise we would have to rely on a "rm -rf" and then some
> > > >> new code to explicitly enumerate and delete all metadata objects
> > > >> for that filesystem.
> > > >>
> > > >> I'm pondering whether this should just be a process that happens
> > > >> via normal client interfaces (where mon/mgr would be the client),
> > > >> or whether it would be feasible/desirable to implement something
> > > >> inside the OSD.  Obviously the OSD ultimately has to do the same
> > > >> underlying enumeration, but at least it doesn't have to thrash
> > > >> through the whole request/response cycle for deleting each object
> > > >> individually -- might enable it to throttle internally in the OSD
> > > >> based on how busy it knows itself to be, rather than having the
> > > >> client apply some arbitrary "only issue N deletions at once" type
> > > >> limit that might make the deletion process unecessarily slow.
> > > >>
> > > >> I have a feeling we must have talked about this at some point but
> > > >> my memory is failing me...
> > > >
> > > >
> > > > This is an interesting thought and I don't think it's been
> > > > discussed before. Some random things:
> > > > 1) We tend to suck at throttling stuff in the OSD (although we're
> > > > getting close to having the right pattern)
> > > > 2) Deleting objects is a *lot* more expensive than listing them;
> > > > David's right it would probably look like pgnls, but it's
> > > > definitely not analogous. For the listing we just go through omap
> > > > keys but here we have to shove deletes into XFS (...for now. I
> > > > suppose they're actually pretty similar in BlueStore, though I
> > > > don't really know the cost of a delete)
> > > > 3) If this happens inside the OSD it will be much harder to bill
> > > > it against client IO. Not sure if that's relevant given it's
> > > > removal, but...
> > > > 4) I'm actually not sure how we could best do this internally if
> > > > we wanted to. Deletes as noted would have to take a while, which
> > > > means it would probably be a very long-lived operation — much more
> > > > than eg 30 seconds. Or else we'd need a whole internal queue of
> > > > stuff to delete, where the client "op" is just queueing up the
> > > > namespace and the actual deletes get spaced out much later...and
> > > > then that would imply more new checks around whether something is
> > > > logically deleted even though it still exists in our data store.
> > > > (And what happens if a user starts putting new objects in the
> > > > namespace while the delete is still going on? Is that blocked
> > > > somehow? Or does the OSD need to handle it with a way of
> > > > distinguishing between namespace "epoch" 1 and 2?)
> > >
> > > Yeah, all those things are why this is a "hmmm" mailing list thread
> > > rather than a PR :-)
> > >
> > > > Overall I definitely see the appeal of wanting to do gross deletes
> > > > like that, but it'll be at least a little trickier than I suspect
> > > > people are considering. Given the semantic confusion and implied
> > > > versus real costs I'm generally not really a fan of allowing stuff
> > > > like this on units other than pools/PGs and I'm not sure when the
> > > > bandwidth to implement it might come up. How important do you
> > > > think it really is compared to doing "rm -rf"?
> > >
> > > The trouble with the literal rm -rf is that is requires e.g. the MDS
> > > to be up and healthy, requires that the deletion code definitely has
> > > no bugs, that there were definitely no stray objects in the
> > > filesystem, etc.  Orchestrating it in a way that ensured it was done
> > > before we permitted the filesystem to be removed would probably
> mean
> > > adding a new "destroying" state to MDSs and having them do the work,
> > > rather than accepting a client's promise that it had done it.
> > >
> > > The intermediate option where we're doing a pgnls and then sending a
> > > delete for each object (probably from the mgr) is less evil, my
> > > concerns with that one are just that we're sending a huge number of
> > > ops to do one logical thing, and that it might be slower than
> > > necessary from the user's point of view.  I suppose we could hide
> > > the latency from the user by having a "trash" list in the FSMap but
> > > that will cause confusion when they look at their df stats (which
> > > people seem to do quite a lot).
> > >
> > > Doing the pgnls + O(N) deletes from the mgr is I suppose going to at
> > > least have some commonality with what we would be doing in a nicely
> > > orchestrated backward scrub, so it's not completely crazy.  People
> > > would still appreciate how much faster it was than doing a rm -rf.
> >
> > I think the most reasonable way to do something like this would be to
> > have a "trash ns queue" list broadcast to all OSDs, and have them do
> > the cleanup during scrub (when they're already iterating over the
> namespace).
> > OSDs would track a lower bound on a trash sequence number that has
> > been scrubbed and applied (or similar) so that delete queue items get
> retired.
> > (We need to do something similar with the snap trimming so that the
> > delete list isn't publishes for all time in the OSDMap.)
> >
> > Scrub is already initiating updates for a couple different reasons (to
> > update digests, and to convert legacy snapsets).  It's worked well so
> > far, although deletion will be a bit more expensive than the current
> > updates which are all xattr-only.
> 
> 
> We discussed this a bit more in the RADOS standup. Sage remains interested
> in the problem, but it sounded to me like the consensus was we shouldn't
> move forward on this:
> 1) doing the delete *outside* of scrub is basically making the RADOS cluster
> look at {M} data to do operations on {N} (where sizeof(N) <<<< sizeof(M)).
> Scrub at least merges it in to work we're already doing, but...
> 2) doing the delete during scrub makes the throttling/scheduling dramatically
> harder
> 3) doing aysnchronous operations of any kind is a giant foot-gun for our API
> users; obviously our in-tree systems are going to know they can't write to a
> namespace after deleting it but you just *know* that somebody is going to
> use it as a shortcut for "delete all my data" and then start writing to it
> immediately after the acknowledgement commit returns. So we probably
> need to introduce epoch versioning of namespaces (nearly impossible?) and
> certainly need async operation reporting. Lots of complexity.
> 4) There don't seem to be any other users and it seems like a lot more work
> to do this in the OSD than to have the MDS or manager orchestrate it
> somehow.
> 
> Of course, now it's written down so people can tell me my understanding
> was wrong... ;) -Greg
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the
> body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at
> http://vger.kernel.org/majordomo-info.html
��.n��������+%������w��{.n����z��u���ܨ}���Ơz�j:+v�����w����ޙ��&�)ߡ�a����z�ޗ���ݢj��w�f