Janne explained the reason for double-movement. There are 2 scenarios for out-phasing OSDs and both have different procedures: 1) Replace a broken disk/host with a new one If you just want to replace a disk or host, the idea is to keep OSD IDs in the crush map to reduce movement to a minimum. There is even a special command for that: "osd destroy <osdname>", which keeps the ID intact. Then you can re-create the OSD with the same ID on the same host and just rebuild the disk/host without much data movement. There is not really a need to wait for full recovery before replacing the disk/host. 2) Outphasing OSDs altogether I'm actually not sure if this really ever happens. I didn't hear of clusters shrinking in size. Hardware is usually replaced. If you really need to remove OSDs and will never add OSDs again, you can purge them from the cluster. However, this will leave holes in the sequence of OSD IDs and there was a bug with that discovered+fixed recently. Something worth avoiding. If you need to re-purpose OSD IDs (new hosts have different disk count or are otherwise different), a better way is to create a special crush root as a parking lot and move the OSD IDs to this crush root until you re-deploy new OSDs. This way you avoid holes in the OSD sequence while also having only one data movement, because the crush map for the original root will be re-computed only once. On our cluster I never delete OSD IDs, there is really never a reason. I either replace disks and assign the same IDs in the same crush location again, or move them to a parking space (second root) and re-use them later in new crush locations. All this is handled by ceph. Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Szabo, Istvan (Agoda) <Istvan.Szabo@xxxxxxxxx> Sent: 05 December 2021 00:32 To: Janne Johansson Cc: ceph-users Subject: Re: Removing an OSD node the right way Maybe if you want to reprovision the osds on that node, it’s even faster if you set noout/norebalance, purge all the osd, recreate them, and unset the noout/rebalance so it would happen only 1 time rebalance. Not sure there is any issue with this. Istvan Szabo Senior Infrastructure Engineer --------------------------------------------------- Agoda Services Co., Ltd. e: istvan.szabo@xxxxxxxxx<mailto:istvan.szabo@xxxxxxxxx> --------------------------------------------------- On 2021. Dec 3., at 13:46, Janne Johansson <icepic.dz@xxxxxxxxx> wrote: Email received from the internet. If in doubt, don't click any link nor open any attachment ! ________________________________ Den fre 3 dec. 2021 kl 13:08 skrev huxiaoyu@xxxxxxxxxxxx <huxiaoyu@xxxxxxxxxxxx>: Dear Cephers, I had to remove a failed OSD server node, and what i did is the following 1) First marked all OSDs on that (to be removed) server down and out 2) Secondly, let Ceph do backfilling and rebalancing, and wait for completing 3) Now i have full redundancy, so i delete thoses removed OSDs from the cluster, e.g. ceph osd cursh remove osd.${OSD_NUM} 4) To my surprise, after removing those already-out OSDs from the cluster, i was seeing a tons of PG remapped and once again BACKFILLING/REBALANCING What is major problems of the above procedure, which caused double BACKFILLING/REBALANCING? The root cause could be on those "already-out" OSDs but "not-yet being-removed" form CRUSH"? I previous thought those "out" OSDs would not impact CRUSH, but it seems i am wrong. If it is still in the crush map with a crush weight, it will "claim" that it brings space to the OSD host, even if there are no active PGs on it. That also means that the first set of movements you did, were more or less placing the PGs on temporary places (as if waiting for the host to come back so they can return there later). When the host finally goes away for real, the crush map is recalculated and PGs will move, not just for the moved PGs but also because there now are fewer hosts so PGs should relocate anyhow. Consider having 4 hosts, and 12 PGS: H1 H2 H3 H4 PG1 PG2 PG3 PG4 PG5 PG6 PG7 PG8 PG9 PG10 PG11 PG12 then you out PG 4,8 and 12 to remove Host4. In this first case, when you out the OSDs, PG4,8 and 12 end up on H1,H2 and H3 in some order just to "recover" them into full redundancy as a repair operation, H4 is thought to exist, but doesn't carry any active PGs. H1 H2 H3 H4 PG1 PG2 PG3 - PG5 PG6 PG7 - PG9 PG10 PG11 - PG4 PG8 PG12 - When you finally make H4 go away, the new "normal" placement should probably be: H1 H2 H3 PG1 PG2 PG3 PG4 PG5 PG6 PG7 PG8 PG9 PG10 PG11 PG12 which means that PG5 now needs to move from H1 to H2, PG6 from H2 to H3, PG7 from H3 to H1 and so on. This is of course my simplified view of the algorithms and how it works, but in general this is how I experience it, and this shows why PG5 moves after you completely remove H4, even if PG5 wasn't really involved in H4 or the first movement round at all. So the crush when you have a host will differ from when it is completely gone, which is why you see two moves and where the second can be much larger. In my example, PG4,8,12 placed themselves on the correct host but they could as well have ended up 12,8,4 instead of 4,8,12 and then they would move along with PG5, PG6 and so on. On top of this, you have not just "a PG" but Replication=X copies, or K+M EC shards of that one PG, not 12 PGs but hundreds or thousands. You often have two or 20 OSDs on every OSD host, so in the final recalculation of the crush when Host4 is gone, you might also see movement of PGs within a host between OSDs, just because some PG is now meant to be on another drive but still on the same host. -- May the most significant bit of your life be positive. _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx