Re: Best practice for expanding Ceph cluster

Josh Baergen <jbaergen@xxxxxxxxxxxxxxxx> · Thu, 4 May 2023 07:46:51 -0600

Hi Samuel,

Both pgremapper and the CERN scripts were developed against Luminous,
and in my experience 12.2.13 has all of the upmap patches needed for
the scheme that Janne outlined to work. However, if you have a complex
CRUSH map sometimes the upmap balancer can struggle, and I think
that's true of any release so far.

Josh

On Thu, May 4, 2023 at 5:58 AM huxiaoyu@xxxxxxxxxxxx
<huxiaoyu@xxxxxxxxxxxx> wrote:
>
> Janne,
>
> thanks a lot for the detailed scheme. I totally agree that the upmap approach would be one of best methods, however, my current cluster is working on Luminious 12.2.13 version and upmap seems not work reliably on Lumnious.
>
> samuel
>
>
>
> huxiaoyu@xxxxxxxxxxxx
>
> From: Janne Johansson
> Date: 2023-05-04 11:56
> To: huxiaoyu@xxxxxxxxxxxx
> CC: ceph-users
> Subject: Re:  Best practice for expanding Ceph cluster
> Den tors 4 maj 2023 kl 10:39 skrev huxiaoyu@xxxxxxxxxxxx
> <huxiaoyu@xxxxxxxxxxxx>:
> > Dear Ceph folks,
> >
> > I am writing to ask for advice on best practice of expanding ceph cluster. We are running an 8-node Ceph cluster and RGW, and would like to add another 10 node, each of which have 10x 12TB HDD. The current 8-node has ca. 400TB user data.
> >
> > I am wondering whether to add 10 nodes at one shot and let the cluster to rebalance, or divide into 5 steps, each of which add 2 nodes and rebalance step by step?  I do not know what would be the advantages or disadvantages with the one shot scheme vs 5 bataches of adding 2 nodes step-by-step.
> >
> > Any suggestions, experience sharing or advice are highly appreciated.
>
> If you add one or two hosts, it will rebalance involving all hosts to
> even out the data. Then you add two more and it has to even all data
> again more or less. Then two more and all old hosts have to redo the
> same work again.
>
> I would suggest that you add all new hosts and make the OSDs start
> with a super-low initial weight (0.0001 or so), which means they will
> be in and up, but not receive any PGs.
>
> Then you set "noout" and "norebalance" and ceph osd crush reweight the
> new OSDs to their correct size, perhaps with a sleep 30 in between or
> so, to let the dust settle after you change weights.
>
> After all new OSDs are of the correct crush weight, there will be a
> lot of PGs misplaced/remapped but not moving. Now you grab one of the
> programs/scripts[1] which talks to upmap and tells it that every
> misplaced PG actually is where you want it to be. You might need to
> run several times, but it usually goes quite fast on the second/third
> run. Even if it never gets 100% of the PGs happy, it is quite
> sufficient if 95-99% are thinking they are at their correct place.
>
> Now, if you enable the ceph balancer (or already have it enabled) in
> upmap mode and unset "noout" and "norebalance" the mgr balancer will
> take a certain amount of PGs (some 3% by default[2] ) and remove the
> temporary "upmap" setting that says the PG is at the right place even
> when it isn't. This means that the balancer takes a small amount of
> PGs, lets them move to where they actually want to be, then picks a
> few more PGs and repeats until the final destination is correct for
> all PGs, evened out on all OSDs as you wanted.
>
> This is the method that I think has the least impact on client IO,
> scrubs and all that, should be quite safe but will take a while in
> calendar time to finish. The best part is that the admin work needed
> comes only in at the beginning, the rest is automatic.
>
> [1] Tools:
> https://raw.githubusercontent.com/HeinleinSupport/cern-ceph-scripts/master/tools/upmap/upmap-remapped.py
> https://github.com/digitalocean/pgremapper
> I think this one works too, haven't tried it:
> https://github.com/TheJJ/ceph-balancer
>
> [2] Percent to have moving at any moment:
> https://docs.ceph.com/en/latest/rados/operations/balancer/#throttling
>
> --
> May the most significant bit of your life be positive.
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx