Den fre 3 dec. 2021 kl 13:08 skrev huxiaoyu@xxxxxxxxxxxx <huxiaoyu@xxxxxxxxxxxx>: > Dear Cephers, > I had to remove a failed OSD server node, and what i did is the following > 1) First marked all OSDs on that (to be removed) server down and out > 2) Secondly, let Ceph do backfilling and rebalancing, and wait for completing > 3) Now i have full redundancy, so i delete thoses removed OSDs from the cluster, e.g. ceph osd cursh remove osd.${OSD_NUM} > 4) To my surprise, after removing those already-out OSDs from the cluster, i was seeing a tons of PG remapped and once again BACKFILLING/REBALANCING > > What is major problems of the above procedure, which caused double BACKFILLING/REBALANCING? The root cause could be on those "already-out" OSDs but "not-yet being-removed" form CRUSH"? I previous thought those "out" OSDs would not impact CRUSH, but it seems i am wrong. If it is still in the crush map with a crush weight, it will "claim" that it brings space to the OSD host, even if there are no active PGs on it. That also means that the first set of movements you did, were more or less placing the PGs on temporary places (as if waiting for the host to come back so they can return there later). When the host finally goes away for real, the crush map is recalculated and PGs will move, not just for the moved PGs but also because there now are fewer hosts so PGs should relocate anyhow. Consider having 4 hosts, and 12 PGS: H1 H2 H3 H4 PG1 PG2 PG3 PG4 PG5 PG6 PG7 PG8 PG9 PG10 PG11 PG12 then you out PG 4,8 and 12 to remove Host4. In this first case, when you out the OSDs, PG4,8 and 12 end up on H1,H2 and H3 in some order just to "recover" them into full redundancy as a repair operation, H4 is thought to exist, but doesn't carry any active PGs. H1 H2 H3 H4 PG1 PG2 PG3 - PG5 PG6 PG7 - PG9 PG10 PG11 - PG4 PG8 PG12 - When you finally make H4 go away, the new "normal" placement should probably be: H1 H2 H3 PG1 PG2 PG3 PG4 PG5 PG6 PG7 PG8 PG9 PG10 PG11 PG12 which means that PG5 now needs to move from H1 to H2, PG6 from H2 to H3, PG7 from H3 to H1 and so on. This is of course my simplified view of the algorithms and how it works, but in general this is how I experience it, and this shows why PG5 moves after you completely remove H4, even if PG5 wasn't really involved in H4 or the first movement round at all. So the crush when you have a host will differ from when it is completely gone, which is why you see two moves and where the second can be much larger. In my example, PG4,8,12 placed themselves on the correct host but they could as well have ended up 12,8,4 instead of 4,8,12 and then they would move along with PG5, PG6 and so on. On top of this, you have not just "a PG" but Replication=X copies, or K+M EC shards of that one PG, not 12 PGs but hundreds or thousands. You often have two or 20 OSDs on every OSD host, so in the final recalculation of the crush when Host4 is gone, you might also see movement of PGs within a host between OSDs, just because some PG is now meant to be on another drive but still on the same host. -- May the most significant bit of your life be positive. _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx