On Tue, Jun 19, 2018 at 12:02 PM, Sage Weil <sage@xxxxxxxxxx> wrote: > > I've seen a couple of cases in the last two weeks where users have removed > OSDs from their clusters too soon and either lost data or had to scramble > to re-add OSDs. I'm wondering how we can improve the CLI/tools to make > this situation harder. > > In luminous, we added a few new commands: > > - ceph osd destroy: zap info about an OSD but keep it's ID in place (with > a 'destroyed' flag) so that it can be recreated with a replacement device. > - ceph osd purge: zap everything about an OSD, including the ID > > Once these commands are run it is hard to re-add the OSD device back into > the cluster if the operator realizes there are PGs stuck peering or > unfound objects. > > There are two other new commands: > > - ceph osd ok-to-stop: Checks whether it looks like PGs will remain > available even if the specified OSD(s) are stopped. > - ceph osd safe-to-destroy: Checks whether it is safe to destroy an OSD. > This does various checks to ensure there is no data on the OSD(s), no > unfound objects, stuck peering, and so forth. > > Once could argue that the users who got into trouble should have run > 'ceph osd safe-to-destory' before removing the devices. That should have > avoided their problem, but we should do what we can to make it hard > for them to *not* see the safety check. > > So, two ideas: > > First, let's get rid of 'ceph osd rm'. Users should use destroy or purge; > this command only does part of the job and has an unclear purpose > now that the others exist. > > Second, change destroy and purge to do the safe-to-destroy check. If the > check passes, do the removal. If the check fails, issue a warning (with > some detail in the error string) and require the --yes-i-really-mean-it > option. (Currently, these commands *always* require the force flag > but don't perform the safety check.) +1 on doing the verification internally, also on a different note, if we can rename osd destroy to osd repace that would be more clarity to end user. > > > The main downside to this is that current scripts (well, post-luminous > scripts) may include the force flag (since it used to be unconditionally > required) and miss the safety check. On the other hand, they are already > ignore the safety check, so they are no worse off... they just see no > benefit. > > Thoughts? > sage > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html