Re: Best practice for expanding Ceph cluster

"huxiaoyu@xxxxxxxxxxxx" <huxiaoyu@xxxxxxxxxxxx> · Thu, 4 May 2023 13:57:20 +0200

Janne,

thanks a lot for the detailed scheme. I totally agree that the upmap approach would be one of best methods, however, my current cluster is working on Luminious 12.2.13 version and upmap seems not work reliably on Lumnious.

samuel

huxiaoyu@xxxxxxxxxxxx

From: Janne Johansson
Date: 2023-05-04 11:56
To: huxiaoyu@xxxxxxxxxxxx
CC: ceph-users
Subject: Re:  Best practice for expanding Ceph cluster
Den tors 4 maj 2023 kl 10:39 skrev huxiaoyu@xxxxxxxxxxxx
<huxiaoyu@xxxxxxxxxxxx>:
> Dear Ceph folks,
>
> I am writing to ask for advice on best practice of expanding ceph cluster. We are running an 8-node Ceph cluster and RGW, and would like to add another 10 node, each of which have 10x 12TB HDD. The current 8-node has ca. 400TB user data.
>
> I am wondering whether to add 10 nodes at one shot and let the cluster to rebalance, or divide into 5 steps, each of which add 2 nodes and rebalance step by step?  I do not know what would be the advantages or disadvantages with the one shot scheme vs 5 bataches of adding 2 nodes step-by-step.
>
> Any suggestions, experience sharing or advice are highly appreciated.

If you add one or two hosts, it will rebalance involving all hosts to
even out the data. Then you add two more and it has to even all data
again more or less. Then two more and all old hosts have to redo the
same work again.

I would suggest that you add all new hosts and make the OSDs start
with a super-low initial weight (0.0001 or so), which means they will
be in and up, but not receive any PGs.

Then you set "noout" and "norebalance" and ceph osd crush reweight the
new OSDs to their correct size, perhaps with a sleep 30 in between or
so, to let the dust settle after you change weights.

After all new OSDs are of the correct crush weight, there will be a
lot of PGs misplaced/remapped but not moving. Now you grab one of the
programs/scripts[1] which talks to upmap and tells it that every
misplaced PG actually is where you want it to be. You might need to
run several times, but it usually goes quite fast on the second/third
run. Even if it never gets 100% of the PGs happy, it is quite
sufficient if 95-99% are thinking they are at their correct place.

Now, if you enable the ceph balancer (or already have it enabled) in
upmap mode and unset "noout" and "norebalance" the mgr balancer will
take a certain amount of PGs (some 3% by default[2] ) and remove the
temporary "upmap" setting that says the PG is at the right place even
when it isn't. This means that the balancer takes a small amount of
PGs, lets them move to where they actually want to be, then picks a
few more PGs and repeats until the final destination is correct for
all PGs, evened out on all OSDs as you wanted.

This is the method that I think has the least impact on client IO,
scrubs and all that, should be quite safe but will take a while in
calendar time to finish. The best part is that the admin work needed
comes only in at the beginning, the rest is automatic.

[1] Tools:
https://raw.githubusercontent.com/HeinleinSupport/cern-ceph-scripts/master/tools/upmap/upmap-remapped.py
https://github.com/digitalocean/pgremapper
I think this one works too, haven't tried it:
https://github.com/TheJJ/ceph-balancer

[2] Percent to have moving at any moment:
https://docs.ceph.com/en/latest/rados/operations/balancer/#throttling

-- 
May the most significant bit of your life be positive.

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx