I've seen a couple of cases in the last two weeks where users have removed OSDs from their clusters too soon and either lost data or had to scramble to re-add OSDs. I'm wondering how we can improve the CLI/tools to make this situation harder. In luminous, we added a few new commands: - ceph osd destroy: zap info about an OSD but keep it's ID in place (with a 'destroyed' flag) so that it can be recreated with a replacement device. - ceph osd purge: zap everything about an OSD, including the ID Once these commands are run it is hard to re-add the OSD device back into the cluster if the operator realizes there are PGs stuck peering or unfound objects. There are two other new commands: - ceph osd ok-to-stop: Checks whether it looks like PGs will remain available even if the specified OSD(s) are stopped. - ceph osd safe-to-destroy: Checks whether it is safe to destroy an OSD. This does various checks to ensure there is no data on the OSD(s), no unfound objects, stuck peering, and so forth. Once could argue that the users who got into trouble should have run 'ceph osd safe-to-destory' before removing the devices. That should have avoided their problem, but we should do what we can to make it hard for them to *not* see the safety check. So, two ideas: First, let's get rid of 'ceph osd rm'. Users should use destroy or purge; this command only does part of the job and has an unclear purpose now that the others exist. Second, change destroy and purge to do the safe-to-destroy check. If the check passes, do the removal. If the check fails, issue a warning (with some detail in the error string) and require the --yes-i-really-mean-it option. (Currently, these commands *always* require the force flag but don't perform the safety check.) The main downside to this is that current scripts (well, post-luminous scripts) may include the force flag (since it used to be unconditionally required) and miss the safety check. On the other hand, they are already ignore the safety check, so they are no worse off... they just see no benefit. Thoughts? sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html