On Tue, May 23, 2017 at 6:10 AM John Spray <jspray@xxxxxxxxxx> wrote: > > Soon, we'll probably be letting multiple CephFS filesystems use the > same data and metadata pools, where the filesystems are separated by > rados namespace. > > When removing filesystems, in the interests of robustness and speed, > I'd like to be able to delete all objects in a namespace -- otherwise > we would have to rely on a "rm -rf" and then some new code to > explicitly enumerate and delete all metadata objects for that > filesystem. > > I'm pondering whether this should just be a process that happens via > normal client interfaces (where mon/mgr would be the client), or > whether it would be feasible/desirable to implement something inside > the OSD. Obviously the OSD ultimately has to do the same underlying > enumeration, but at least it doesn't have to thrash through the whole > request/response cycle for deleting each object individually -- might > enable it to throttle internally in the OSD based on how busy it knows > itself to be, rather than having the client apply some arbitrary "only > issue N deletions at once" type limit that might make the deletion > process unecessarily slow. > > I have a feeling we must have talked about this at some point but my > memory is failing me... This is an interesting thought and I don't think it's been discussed before. Some random things: 1) We tend to suck at throttling stuff in the OSD (although we're getting close to having the right pattern) 2) Deleting objects is a *lot* more expensive than listing them; David's right it would probably look like pgnls, but it's definitely not analogous. For the listing we just go through omap keys but here we have to shove deletes into XFS (...for now. I suppose they're actually pretty similar in BlueStore, though I don't really know the cost of a delete) 3) If this happens inside the OSD it will be much harder to bill it against client IO. Not sure if that's relevant given it's removal, but... 4) I'm actually not sure how we could best do this internally if we wanted to. Deletes as noted would have to take a while, which means it would probably be a very long-lived operation — much more than eg 30 seconds. Or else we'd need a whole internal queue of stuff to delete, where the client "op" is just queueing up the namespace and the actual deletes get spaced out much later...and then that would imply more new checks around whether something is logically deleted even though it still exists in our data store. (And what happens if a user starts putting new objects in the namespace while the delete is still going on? Is that blocked somehow? Or does the OSD need to handle it with a way of distinguishing between namespace "epoch" 1 and 2?) Overall I definitely see the appeal of wanting to do gross deletes like that, but it'll be at least a little trickier than I suspect people are considering. Given the semantic confusion and implied versus real costs I'm generally not really a fan of allowing stuff like this on units other than pools/PGs and I'm not sure when the bandwidth to implement it might come up. How important do you think it really is compared to doing "rm -rf"? -Greg -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html