Based on this discussion on the delayed deletes, has there already been any previous discussion/effort on expired TTL'd deletes? This would align Ceph to other distributed (or SSTable-based) systems, like Cassandra, which mark data as deleted (aka tombstones) and then consolidate them during scheduled overnight/off-peak timeframes (usually at compaction stage). Ernesto > -----Original Message----- > From: ceph-devel-owner@xxxxxxxxxxxxxxx [mailto:ceph-devel- > owner@xxxxxxxxxxxxxxx] On Behalf Of Gregory Farnum > Sent: miércoles, 24 de mayo de 2017 22:47 > To: Sage Weil <sweil@xxxxxxxxxx>; John Spray <jspray@xxxxxxxxxx> > Cc: Ceph Development <ceph-devel@xxxxxxxxxxxxxxx> > Subject: Re: RADOS: Deleting all objects in a namespace > > On Wed, May 24, 2017 at 6:32 AM Sage Weil <sweil@xxxxxxxxxx> wrote: > > > > On Wed, 24 May 2017, John Spray wrote: > > > On Wed, May 24, 2017 at 6:12 AM, Gregory Farnum > <gfarnum@xxxxxxxxxx> wrote: > > > > On Tue, May 23, 2017 at 6:10 AM John Spray <jspray@xxxxxxxxxx> > wrote: > > > >> > > > >> Soon, we'll probably be letting multiple CephFS filesystems use > > > >> the same data and metadata pools, where the filesystems are > > > >> separated by rados namespace. > > > >> > > > >> When removing filesystems, in the interests of robustness and > > > >> speed, I'd like to be able to delete all objects in a namespace > > > >> -- otherwise we would have to rely on a "rm -rf" and then some > > > >> new code to explicitly enumerate and delete all metadata objects > > > >> for that filesystem. > > > >> > > > >> I'm pondering whether this should just be a process that happens > > > >> via normal client interfaces (where mon/mgr would be the client), > > > >> or whether it would be feasible/desirable to implement something > > > >> inside the OSD. Obviously the OSD ultimately has to do the same > > > >> underlying enumeration, but at least it doesn't have to thrash > > > >> through the whole request/response cycle for deleting each object > > > >> individually -- might enable it to throttle internally in the OSD > > > >> based on how busy it knows itself to be, rather than having the > > > >> client apply some arbitrary "only issue N deletions at once" type > > > >> limit that might make the deletion process unecessarily slow. > > > >> > > > >> I have a feeling we must have talked about this at some point but > > > >> my memory is failing me... > > > > > > > > > > > > This is an interesting thought and I don't think it's been > > > > discussed before. Some random things: > > > > 1) We tend to suck at throttling stuff in the OSD (although we're > > > > getting close to having the right pattern) > > > > 2) Deleting objects is a *lot* more expensive than listing them; > > > > David's right it would probably look like pgnls, but it's > > > > definitely not analogous. For the listing we just go through omap > > > > keys but here we have to shove deletes into XFS (...for now. I > > > > suppose they're actually pretty similar in BlueStore, though I > > > > don't really know the cost of a delete) > > > > 3) If this happens inside the OSD it will be much harder to bill > > > > it against client IO. Not sure if that's relevant given it's > > > > removal, but... > > > > 4) I'm actually not sure how we could best do this internally if > > > > we wanted to. Deletes as noted would have to take a while, which > > > > means it would probably be a very long-lived operation — much more > > > > than eg 30 seconds. Or else we'd need a whole internal queue of > > > > stuff to delete, where the client "op" is just queueing up the > > > > namespace and the actual deletes get spaced out much later...and > > > > then that would imply more new checks around whether something is > > > > logically deleted even though it still exists in our data store. > > > > (And what happens if a user starts putting new objects in the > > > > namespace while the delete is still going on? Is that blocked > > > > somehow? Or does the OSD need to handle it with a way of > > > > distinguishing between namespace "epoch" 1 and 2?) > > > > > > Yeah, all those things are why this is a "hmmm" mailing list thread > > > rather than a PR :-) > > > > > > > Overall I definitely see the appeal of wanting to do gross deletes > > > > like that, but it'll be at least a little trickier than I suspect > > > > people are considering. Given the semantic confusion and implied > > > > versus real costs I'm generally not really a fan of allowing stuff > > > > like this on units other than pools/PGs and I'm not sure when the > > > > bandwidth to implement it might come up. How important do you > > > > think it really is compared to doing "rm -rf"? > > > > > > The trouble with the literal rm -rf is that is requires e.g. the MDS > > > to be up and healthy, requires that the deletion code definitely has > > > no bugs, that there were definitely no stray objects in the > > > filesystem, etc. Orchestrating it in a way that ensured it was done > > > before we permitted the filesystem to be removed would probably > mean > > > adding a new "destroying" state to MDSs and having them do the work, > > > rather than accepting a client's promise that it had done it. > > > > > > The intermediate option where we're doing a pgnls and then sending a > > > delete for each object (probably from the mgr) is less evil, my > > > concerns with that one are just that we're sending a huge number of > > > ops to do one logical thing, and that it might be slower than > > > necessary from the user's point of view. I suppose we could hide > > > the latency from the user by having a "trash" list in the FSMap but > > > that will cause confusion when they look at their df stats (which > > > people seem to do quite a lot). > > > > > > Doing the pgnls + O(N) deletes from the mgr is I suppose going to at > > > least have some commonality with what we would be doing in a nicely > > > orchestrated backward scrub, so it's not completely crazy. People > > > would still appreciate how much faster it was than doing a rm -rf. > > > > I think the most reasonable way to do something like this would be to > > have a "trash ns queue" list broadcast to all OSDs, and have them do > > the cleanup during scrub (when they're already iterating over the > namespace). > > OSDs would track a lower bound on a trash sequence number that has > > been scrubbed and applied (or similar) so that delete queue items get > retired. > > (We need to do something similar with the snap trimming so that the > > delete list isn't publishes for all time in the OSDMap.) > > > > Scrub is already initiating updates for a couple different reasons (to > > update digests, and to convert legacy snapsets). It's worked well so > > far, although deletion will be a bit more expensive than the current > > updates which are all xattr-only. > > > We discussed this a bit more in the RADOS standup. Sage remains interested > in the problem, but it sounded to me like the consensus was we shouldn't > move forward on this: > 1) doing the delete *outside* of scrub is basically making the RADOS cluster > look at {M} data to do operations on {N} (where sizeof(N) <<<< sizeof(M)). > Scrub at least merges it in to work we're already doing, but... > 2) doing the delete during scrub makes the throttling/scheduling dramatically > harder > 3) doing aysnchronous operations of any kind is a giant foot-gun for our API > users; obviously our in-tree systems are going to know they can't write to a > namespace after deleting it but you just *know* that somebody is going to > use it as a shortcut for "delete all my data" and then start writing to it > immediately after the acknowledgement commit returns. So we probably > need to introduce epoch versioning of namespaces (nearly impossible?) and > certainly need async operation reporting. Lots of complexity. > 4) There don't seem to be any other users and it seems like a lot more work > to do this in the OSD than to have the MDS or manager orchestrate it > somehow. > > Of course, now it's written down so people can tell me my understanding > was wrong... ;) -Greg > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the > body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at > http://vger.kernel.org/majordomo-info.html ��.n��������+%������w��{.n����z��u���ܨ}���Ơz�j:+v�����w����ޙ��&�)ߡ�a����z�ޗ���ݢj��w�f