Janne, thanks a lot for the detailed scheme. I totally agree that the upmap approach would be one of best methods, however, my current cluster is working on Luminious 12.2.13 version and upmap seems not work reliably on Lumnious. samuel huxiaoyu@xxxxxxxxxxxx From: Janne Johansson Date: 2023-05-04 11:56 To: huxiaoyu@xxxxxxxxxxxx CC: ceph-users Subject: Re: Best practice for expanding Ceph cluster Den tors 4 maj 2023 kl 10:39 skrev huxiaoyu@xxxxxxxxxxxx <huxiaoyu@xxxxxxxxxxxx>: > Dear Ceph folks, > > I am writing to ask for advice on best practice of expanding ceph cluster. We are running an 8-node Ceph cluster and RGW, and would like to add another 10 node, each of which have 10x 12TB HDD. The current 8-node has ca. 400TB user data. > > I am wondering whether to add 10 nodes at one shot and let the cluster to rebalance, or divide into 5 steps, each of which add 2 nodes and rebalance step by step? I do not know what would be the advantages or disadvantages with the one shot scheme vs 5 bataches of adding 2 nodes step-by-step. > > Any suggestions, experience sharing or advice are highly appreciated. If you add one or two hosts, it will rebalance involving all hosts to even out the data. Then you add two more and it has to even all data again more or less. Then two more and all old hosts have to redo the same work again. I would suggest that you add all new hosts and make the OSDs start with a super-low initial weight (0.0001 or so), which means they will be in and up, but not receive any PGs. Then you set "noout" and "norebalance" and ceph osd crush reweight the new OSDs to their correct size, perhaps with a sleep 30 in between or so, to let the dust settle after you change weights. After all new OSDs are of the correct crush weight, there will be a lot of PGs misplaced/remapped but not moving. Now you grab one of the programs/scripts[1] which talks to upmap and tells it that every misplaced PG actually is where you want it to be. You might need to run several times, but it usually goes quite fast on the second/third run. Even if it never gets 100% of the PGs happy, it is quite sufficient if 95-99% are thinking they are at their correct place. Now, if you enable the ceph balancer (or already have it enabled) in upmap mode and unset "noout" and "norebalance" the mgr balancer will take a certain amount of PGs (some 3% by default[2] ) and remove the temporary "upmap" setting that says the PG is at the right place even when it isn't. This means that the balancer takes a small amount of PGs, lets them move to where they actually want to be, then picks a few more PGs and repeats until the final destination is correct for all PGs, evened out on all OSDs as you wanted. This is the method that I think has the least impact on client IO, scrubs and all that, should be quite safe but will take a while in calendar time to finish. The best part is that the admin work needed comes only in at the beginning, the rest is automatic. [1] Tools: https://raw.githubusercontent.com/HeinleinSupport/cern-ceph-scripts/master/tools/upmap/upmap-remapped.py https://github.com/digitalocean/pgremapper I think this one works too, haven't tried it: https://github.com/TheJJ/ceph-balancer [2] Percent to have moving at any moment: https://docs.ceph.com/en/latest/rados/operations/balancer/#throttling -- May the most significant bit of your life be positive. _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx