Re: Removing an OSD node the right way

Dan van der Ster <dan@xxxxxxxxxxxxxx> · Fri, 3 Dec 2021 13:13:25 +0100

Hi,

This is indeed the expected behaviour.

The in/out are used as a 2nd factor weight in the OSD placement algorithm.
So crush weight 1, weight 0 is not equivalent to crush weight 0.

The correct way to decommission OSDs / Hosts is to decrease the crush weight.

Cheers, Dan

On Fri, Dec 3, 2021 at 1:08 PM huxiaoyu@xxxxxxxxxxxx
<huxiaoyu@xxxxxxxxxxxx> wrote:
>
> Dear Cephers,
>
> I had to remove a failed OSD server node, and what i did is the following
> 1) First marked all OSDs on that (to be removed) server down and out
> 2) Secondly, let Ceph do backfilling and rebalancing, and wait for completing
> 3) Now i have full redundancy, so i delete thoses removed OSDs from the cluster, e.g. ceph osd cursh remove osd.${OSD_NUM}
> 4) To my surprise, after removing those already-out OSDs from the cluster, i was seeing a tons of PG remapped and once again BACKFILLING/REBALANCING
>
> What is major problems of the above procedure, which caused double BACKFILLING/REBALANCING?  The root cause could be on those "already-out" OSDs but "not-yet being-removed" form CRUSH"? I previous thought those "out" OSDs would not impact CRUSH, but it seems i am wrong.
>
> Any suggestions, comments, explanations are highly appreciated,
>
> Best regards,
>
> Samuel
>
>
>
> huxiaoyu@xxxxxxxxxxxx
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx