Re: Removing an OSD node the right way

Janne Johansson <icepic.dz@xxxxxxxxx> · Fri, 3 Dec 2021 13:46:00 +0100

Den fre 3 dec. 2021 kl 13:08 skrev huxiaoyu@xxxxxxxxxxxx
<huxiaoyu@xxxxxxxxxxxx>:
> Dear Cephers,
> I had to remove a failed OSD server node, and what i did is the following
> 1) First marked all OSDs on that (to be removed) server down and out
> 2) Secondly, let Ceph do backfilling and rebalancing, and wait for completing
> 3) Now i have full redundancy, so i delete thoses removed OSDs from the cluster, e.g. ceph osd cursh remove osd.${OSD_NUM}
> 4) To my surprise, after removing those already-out OSDs from the cluster, i was seeing a tons of PG remapped and once again BACKFILLING/REBALANCING
>
> What is major problems of the above procedure, which caused double BACKFILLING/REBALANCING?  The root cause could be on those "already-out" OSDs but "not-yet being-removed" form CRUSH"? I previous thought those "out" OSDs would not impact CRUSH, but it seems i am wrong.

If it is still in the crush map with a crush weight, it will "claim"
that it brings space to the OSD host, even if there are no active PGs
on it. That also means that the first set of movements you did, were
more or less placing the PGs on temporary places (as if waiting for
the host to come back so they can return there later). When the host
finally goes away for real, the crush map is recalculated and PGs will
move, not just for the moved PGs but also because there now are fewer
hosts so PGs should relocate anyhow.

Consider having 4 hosts, and 12 PGS:

H1      H2      H3     H4
PG1   PG2   PG3   PG4
PG5  PG6   PG7    PG8
PG9   PG10 PG11  PG12

then you out PG 4,8 and 12 to remove Host4.

In this first case, when you out the OSDs, PG4,8 and 12 end up on
H1,H2 and H3 in some order just to "recover" them into full redundancy
as a repair operation, H4 is thought to exist, but doesn't carry any
active PGs.

H1     H2     H3     H4
PG1  PG2   PG3  -
PG5  PG6   PG7  -
PG9  PG10 PG11 -
PG4  PG8   PG12 -

When you finally make H4 go away, the new "normal" placement should probably be:

H1     H2     H3
PG1  PG2   PG3
PG4  PG5   PG6
PG7  PG8   PG9
PG10 PG11 PG12

which means that PG5 now needs to move from H1 to H2, PG6 from H2 to
H3, PG7 from H3 to H1 and so on.

This is of course my simplified view of the algorithms and how it
works, but in general this is how I experience it, and this shows why
PG5 moves after you completely remove H4, even if PG5 wasn't really
involved in H4 or the first movement round at all. So the crush when
you have a host will differ from when it is completely gone, which is
why you see two moves and where the second can be much larger.
In my example, PG4,8,12 placed themselves on the correct host but they
could as well have ended up 12,8,4 instead of 4,8,12 and then they
would move along with PG5, PG6 and so on.

On top of this, you have not just "a PG" but Replication=X copies, or
K+M EC shards of that one PG, not 12 PGs but hundreds or thousands.
You often have two or 20 OSDs on every OSD host, so in the final
recalculation of the crush when Host4 is gone, you might also see
movement of PGs within a host between OSDs, just because some PG is
now meant to be on another drive but still on the same host.

-- 
May the most significant bit of your life be positive.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx