Re: making osd removal safer

John Spray <jspray@xxxxxxxxxx> · Wed, 20 Jun 2018 11:37:53 +0100

On Tue, Jun 19, 2018 at 8:02 PM Sage Weil <sage@xxxxxxxxxx> wrote:
>
> I've seen a couple of cases in the last two weeks where users have removed
> OSDs from their clusters too soon and either lost data or had to scramble
> to re-add OSDs.  I'm wondering how we can improve the CLI/tools to make
> this situation harder.
>
> In luminous, we added a few new commands:
>
> - ceph osd destroy: zap info about an OSD but keep it's ID in place (with
> a 'destroyed' flag) so that it can be recreated with a replacement device.
> - ceph osd purge: zap everything about an OSD, including the ID
>
> Once these commands are run it is hard to re-add the OSD device back into
> the cluster if the operator realizes there are PGs stuck peering or
> unfound objects.
>
> There are two other new commands:
>
> - ceph osd ok-to-stop: Checks whether it looks like PGs will remain
> available even if the specified OSD(s) are stopped.
> - ceph osd safe-to-destroy: Checks whether it is safe to destroy an OSD.
> This does various checks to ensure there is no data on the OSD(s), no
> unfound objects, stuck peering, and so forth.
>
> Once could argue that the users who got into trouble should have run
> 'ceph osd safe-to-destory' before removing the devices.  That should have
> avoided their problem, but we should do what we can to make it hard
> for them to *not* see the safety check.
>
> So, two ideas:
>
> First, let's get rid of 'ceph osd rm'.  Users should use destroy or purge;
> this command only does part of the job and has an unclear purpose
> now that the others exist.

+1

> Second, change destroy and purge to do the safe-to-destroy check.  If the
> check passes, do the removal.  If the check fails, issue a warning (with
> some detail in the error string) and require the --yes-i-really-mean-it
> option.  (Currently, these commands *always* require the force flag
> but don't perform the safety check.)

Yes, and in the case where the check fails, we can be helpful by
perhaps marking the OSD out to prompt data to migrate away, and/or
give a message explaining "there's X MB of data on here, but we're
migrating it away (at X objects per second)".

This is a bit OT, but I'm a bit dubious about continuing to use
"--yes-i-really-mean-it" flags instead of a simpler "--force" flag.
The yes-i-really-mean-it sometimes feels like a Ceph in-joke, when
most other software calls it --force or similar.  Force is maybe a bit
too easy to type, but a lot of our users seem to have
yes-i-really-mean-it in their muscle memory too :-/

> The main downside to this is that current scripts (well, post-luminous
> scripts) may include the force flag (since it used to be unconditionally
> required) and miss the safety check.  On the other hand, they are already
> ignore the safety check, so they are no worse off... they just see no
> benefit.

I agree -- any script which is trying to do a purge unconditionally
will get what they asked for!

John

>
> Thoughts?
> sage
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html