Re: ceph osd safe to remove

Dan van der Ster <dan@xxxxxxxxxxxxxx> · Thu, 3 Aug 2017 11:56:24 +0200

On Thu, Aug 3, 2017 at 11:42 AM, Peter Maloney
<peter.maloney@xxxxxxxxxxxxxxxxxxxx> wrote:
> On 08/03/17 11:05, Dan van der Ster wrote:
>
> On Fri, Jul 28, 2017 at 9:42 PM, Peter Maloney
> <peter.maloney@xxxxxxxxxxxxxxxxxxxx> wrote:
>
> Hello Dan,
>
> Based on what I know and what people told me on IRC, this means basicaly the
> condition that the osd is not acting nor up for any pg. And for one person
> (fusl on irc) that said there was a unfound objects bug when he had size =
> 1, also he said if reweight (and I assume crush weight) is 0, it will surely
> be safe, but possibly it won't be otherwise.
>
> And so here I took my bc-ceph-reweight-by-utilization.py script that already
> parses `ceph pg dump --format=json` (for up,acting,bytes,count of pgs) and
> `ceph osd df --format=json` (for weight and reweight), and gutted out the
> unneeded parts, and changed the report to show the condition I described as
> True or False per OSD. So the ceph auth needs to allow ceph pg dump and ceph
> osd df. The script is attached.
>
> The script doesn't assume you're ok with acting lower than size, or care
> about min_size, and just assumes you want the OSD completely empty.
>
> Thanks for this script. In fact, I am trying to use the
> min_size/size-based removal heuristics. If we would be able to wait
> until an OSD is completely empty, then I suppose could just set the
> crush weight to 0 then wait for HEALTH_OK. For our procedures I'm
> trying to shortcut this with an earlier device removal.
>
> Cheers, Dan
>
> Well what this is intended for is you can set some weight 0, then later set
> others weight 0, etc. and before all are done, you can remove some that the
> script identifies (no pgs are on that disk, even if other pgs are still
> being moved on other disks). So it's a shortcut, but only by gaining
> knowledge, not by sacrificing redundancy.
>
> And I wasn't sure what you preferred... I definitely prefer to have my full
> size achieved, not just min_size if I'm going to remove something. Just like
> how you don't run raid5 on large disks, and instead use raid6, and would
> only replace one disk at a time so you still have redundancy.
>
> What do you use so keeping redundancy isn't important?

In general yes of course we want up/acting = size, but there are some
exceptions, for example:

   - Disk shows a smart error but the OSD is still running. Can we
stop this OSD now and swap it out? It would be slightly unsafe to let
this drive continue (it could read out bad data). There can be more
than 1 drive in this state in the cluster -- such a tool could let the
repair service know if they can remove more than one drive at once.

   - Host has a faulty memory module and needs power off intervention.
Is it safe right now to shutdown this host? (In general they should
only power off a host when we have HEALTH_OK, but there may be
occasions when a reboot is OK despite HEALTH_WARN).

Cheers, Dan
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com