Re: ceph cluster extremely unbalanced

"Alexander E. Patrakov" <patrakov@xxxxxxxxx> · Mon, 25 Mar 2024 18:58:40 +0800

Hi Denis,

As the vast majority of OSDs have bluestore_min_alloc_size = 65536, I
think you can safely ignore https://tracker.ceph.com/issues/64715. The
only consequence will be that 58 OSDs will be less full than others.
In other words, please use either the hybrid approach or the built-in
balancer right away.

As for migrating to the modern defaults for bluestore_min_alloc_size,
yes, recreating OSDs host-by-host (once you have the cluster balanced)
is the only way. You can keep using the built-in balancer while doing
that.

On Mon, Mar 25, 2024 at 5:04 PM Denis Polom <denispolom@xxxxxxxxx> wrote:
>
> Hi Alexander,
>
> that sounds pretty promising to me.
>
> I've checked bluestore_min_alloc_size and most 1370 OSDs have value 65536.
>
> You mentioned: "You will have to do that weekly until you redeploy all
> OSDs that were created with 64K bluestore_min_alloc_size"
>
> Is it the only way to approach this, that each OSD has to be recreated?
>
> Thank you for reply
>
> dp
>
> On 3/24/24 12:44 PM, Alexander E. Patrakov wrote:
> > Hi Denis,
> >
> > My approach would be:
> >
> > 1. Run "ceph osd metadata" and see if you have a mix of 64K and 4K
> > bluestore_min_alloc_size. If so, you cannot really use the built-in
> > balancer, as it would result in a bimodal distribution instead of a
> > proper balance, see https://tracker.ceph.com/issues/64715, but let's
> > ignore this little issue if you have enough free space.
> > 2. Change the weights as appropriate. Make absolutely sure that there
> > are no reweights other than 1.0. Delete all dead or destroyed OSDs
> > from the CRUSH map by purging them. Ignore any PG_BACKFILL_FULL
> > warnings that appear, they will be gone during the next step.
> > 3. Run this little script from Cern to stop the data movement that was
> > just initiated:
> > https://raw.githubusercontent.com/cernceph/ceph-scripts/master/tools/upmap/upmap-remapped.py,
> > pipe its output to bash. This should cancel most of the data movement,
> > but not all - the script cannot stop the situation when two OSDs want
> > to exchange their erasure-coded shards, like this: [1,2,3,4] ->
> > [1,3,2,4].
> > 4. Set the "target max misplaced ratio" option for MGR to what you
> > think is appropriate. The default is 0.05, and this means that the
> > balancer will enable at most 5% of the PGs to participate in the data
> > movement. I suggest starting with 0.01 and increasing if there is no
> > visible impact of the balancing on the client traffic.
> > 5. Enable the balancer.
> >
> > If you think that https://tracker.ceph.com/issues/64715 is a problem
> > that would prevent you from using the built-in balancer:
> >
> > 4. Download this script:
> > https://raw.githubusercontent.com/TheJJ/ceph-balancer/master/placementoptimizer.py
> > 5. Run it as follows: ./placementoptimizer.py -v balance --osdsize
> > device --osdused delta --max-pg-moves 500 --osdfrom fullest | bash
> >
> > This will move at most 500 PGs to better places, starting with the
> > fullest OSDs. All weights are ignored, and the switches take care of
> > the bluestore_min_alloc_size overhead mismatch. You will have to do
> > that weekly until you redeploy all OSDs that were created with 64K
> > bluestore_min_alloc_size.
> >
> > A hybrid approach (initial round of balancing with TheJJ, then switch
> > to the built-in balancer) may also be viable.
> >
> > On Sun, Mar 24, 2024 at 7:09 PM Denis Polom <denispolom@xxxxxxxxx> wrote:
> >> Hi guys,
> >>
> >> recently I took over a care of Ceph cluster that is extremely
> >> unbalanced. Cluster is running on Quincy 17.2.7 (upgraded Nautilus ->
> >> Octopus -> Quincy) and has 1428 OSDs (HDDs). We are running CephFS on it.
> >>
> >> Crush failure domain is datacenter (there are 3), data pool is EC 3+3.
> >>
> >> This cluster had and has balancer disabled for years. And was "balanced"
> >> manually by changing OSDs crush weights. So now it is complete mess and
> >> I would like to change it to have OSDs crush weight same (3.63898)  and
> >> to enable balancer with upmap.
> >>
> >>   From `ceph osd df ` sorted from the least used to most used OSDs:
> >>
> >> ID    CLASS  WEIGHT   REWEIGHT  SIZE     RAW USE  DATA OMAP     META
> >> AVAIL     %USE   VAR   PGS  STATUS
> >> MIN/MAX VAR: 0.76/1.16  STDDEV: 5.97
> >>                            TOTAL  5.1 PiB  3.7 PiB  3.7 PiB  2.9 MiB  8.5
> >> TiB   1.5 PiB  71.50
> >>    428    hdd  3.63898   1.00000  3.6 TiB  2.0 TiB  2.0 TiB    1 KiB  5.6
> >> GiB   1.7 TiB  54.55  0.76   96      up
> >>    223    hdd  3.63898   1.00000  3.6 TiB  2.0 TiB  2.0 TiB    3 KiB  5.6
> >> GiB   1.7 TiB  54.58  0.76   95      up
> >> ...
> >>
> >> ...
> >>
> >> ...
> >>
> >>    591    hdd  3.53999   1.00000  3.6 TiB  3.0 TiB  3.0 TiB    1 KiB  7.0
> >> GiB   680 GiB  81.74  1.14  125      up
> >>    832    hdd  3.59999   1.00000  3.6 TiB  3.0 TiB  3.0 TiB    4 KiB  6.9
> >> GiB   680 GiB  81.75  1.14  114      up
> >>    248    hdd  3.63898   1.00000  3.6 TiB  3.0 TiB  3.0 TiB    3 KiB  7.2
> >> GiB   646 GiB  82.67  1.16  121      up
> >>    559    hdd  3.63799   1.00000  3.6 TiB  3.0 TiB  3.0 TiB      0 B  7.0
> >> GiB   644 GiB  82.70  1.16  123      up
> >>                            TOTAL  5.1 PiB  3.7 PiB  3.6 PiB  2.9 MiB  8.5
> >> TiB   1.5 PiB  71.50
> >> MIN/MAX VAR: 0.76/1.16  STDDEV: 5.97
> >>
> >>
> >> crush rule:
> >>
> >> {
> >>       "rule_id": 10,
> >>       "rule_name": "ec33hdd_rule",
> >>       "type": 3,
> >>       "steps": [
> >>           {
> >>               "op": "set_chooseleaf_tries",
> >>               "num": 5
> >>           },
> >>           {
> >>               "op": "set_choose_tries",
> >>               "num": 100
> >>           },
> >>           {
> >>               "op": "take",
> >>               "item": -2,
> >>               "item_name": "default~hdd"
> >>           },
> >>           {
> >>               "op": "choose_indep",
> >>               "num": 3,
> >>               "type": "datacenter"
> >>           },
> >>           {
> >>               "op": "choose_indep",
> >>               "num": 2,
> >>               "type": "osd"
> >>           },
> >>           {
> >>               "op": "emit"
> >>           }
> >>       ]
> >> }
> >>
> >> My question is what would be proper and most safer way to make it happen.
> >>
> >> * should I first enable balancer and let it do its work and after that
> >> change the OSDs crush weights to be even?
> >>
> >> * or should it otherwise - first to make crush weights even and then
> >> enable the balancer?
> >>
> >> * or is there another safe(r) way?
> >>
> >> What are the ideal balancer settings for that?
> >>
> >> I'm expecting a large data movement, and this is production cluster.
> >>
> >> I'm also afraid that during the balancing or changing crush weights some
> >> OSDs become full. I've tried that already and had to move some PGs
> >> manually to another OSDs in the same failure domain.
> >>
> >>
> >> I would appreciate any suggestion on that.
> >>
> >> Thank you!
> >> _______________________________________________
> >> ceph-users mailing list -- ceph-users@xxxxxxx
> >> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >
> >

-- 
Alexander E. Patrakov
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx