Re: making osd removal safer

Vasu Kulkarni <vakulkar@xxxxxxxxxx> · Tue, 19 Jun 2018 17:59:03 -0700



On Tue, Jun 19, 2018 at 12:02 PM, Sage Weil <sage@xxxxxxxxxx> wrote:
>
> I've seen a couple of cases in the last two weeks where users have removed
> OSDs from their clusters too soon and either lost data or had to scramble
> to re-add OSDs.  I'm wondering how we can improve the CLI/tools to make
> this situation harder.
>
> In luminous, we added a few new commands:
>
> - ceph osd destroy: zap info about an OSD but keep it's ID in place (with
> a 'destroyed' flag) so that it can be recreated with a replacement device.
> - ceph osd purge: zap everything about an OSD, including the ID
>
> Once these commands are run it is hard to re-add the OSD device back into
> the cluster if the operator realizes there are PGs stuck peering or
> unfound objects.
>
> There are two other new commands:
>
> - ceph osd ok-to-stop: Checks whether it looks like PGs will remain
> available even if the specified OSD(s) are stopped.
> - ceph osd safe-to-destroy: Checks whether it is safe to destroy an OSD.
> This does various checks to ensure there is no data on the OSD(s), no
> unfound objects, stuck peering, and so forth.
>
> Once could argue that the users who got into trouble should have run
> 'ceph osd safe-to-destory' before removing the devices.  That should have
> avoided their problem, but we should do what we can to make it hard
> for them to *not* see the safety check.
>
> So, two ideas:
>
> First, let's get rid of 'ceph osd rm'.  Users should use destroy or purge;
> this command only does part of the job and has an unclear purpose
> now that the others exist.
>
> Second, change destroy and purge to do the safe-to-destroy check.  If the
> check passes, do the removal.  If the check fails, issue a warning (with
> some detail in the error string) and require the --yes-i-really-mean-it
> option.  (Currently, these commands *always* require the force flag
> but don't perform the safety check.)

+1 on doing the verification internally, also on a different note, if
we can rename
osd destroy to osd repace that would be more clarity to end user.

>
>
> The main downside to this is that current scripts (well, post-luminous
> scripts) may include the force flag (since it used to be unconditionally
> required) and miss the safety check.  On the other hand, they are already
> ignore the safety check, so they are no worse off... they just see no
> benefit.
>
> Thoughts?
> sage
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html