Removing an OSD node the right way

"huxiaoyu@xxxxxxxxxxxx" <huxiaoyu@xxxxxxxxxxxx> · Fri, 3 Dec 2021 13:08:04 +0100

Dear Cephers,

I had to remove a failed OSD server node, and what i did is the following
1) First marked all OSDs on that (to be removed) server down and out
2) Secondly, let Ceph do backfilling and rebalancing, and wait for completing
3) Now i have full redundancy, so i delete thoses removed OSDs from the cluster, e.g. ceph osd cursh remove osd.${OSD_NUM}
4) To my surprise, after removing those already-out OSDs from the cluster, i was seeing a tons of PG remapped and once again BACKFILLING/REBALANCING

What is major problems of the above procedure, which caused double BACKFILLING/REBALANCING?  The root cause could be on those "already-out" OSDs but "not-yet being-removed" form CRUSH"? I previous thought those "out" OSDs would not impact CRUSH, but it seems i am wrong.

Any suggestions, comments, explanations are highly appreciated,

Best regards,

Samuel

huxiaoyu@xxxxxxxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx