Re: ceph cluster extremely unbalanced

"Alexander E. Patrakov" <patrakov@xxxxxxxxx> · Sun, 24 Mar 2024 19:44:38 +0800

Hi Denis,

My approach would be:

1. Run "ceph osd metadata" and see if you have a mix of 64K and 4K
bluestore_min_alloc_size. If so, you cannot really use the built-in
balancer, as it would result in a bimodal distribution instead of a
proper balance, see https://tracker.ceph.com/issues/64715, but let's
ignore this little issue if you have enough free space.
2. Change the weights as appropriate. Make absolutely sure that there
are no reweights other than 1.0. Delete all dead or destroyed OSDs
from the CRUSH map by purging them. Ignore any PG_BACKFILL_FULL
warnings that appear, they will be gone during the next step.
3. Run this little script from Cern to stop the data movement that was
just initiated:
https://raw.githubusercontent.com/cernceph/ceph-scripts/master/tools/upmap/upmap-remapped.py,
pipe its output to bash. This should cancel most of the data movement,
but not all - the script cannot stop the situation when two OSDs want
to exchange their erasure-coded shards, like this: [1,2,3,4] ->
[1,3,2,4].
4. Set the "target max misplaced ratio" option for MGR to what you
think is appropriate. The default is 0.05, and this means that the
balancer will enable at most 5% of the PGs to participate in the data
movement. I suggest starting with 0.01 and increasing if there is no
visible impact of the balancing on the client traffic.
5. Enable the balancer.

If you think that https://tracker.ceph.com/issues/64715 is a problem
that would prevent you from using the built-in balancer:

4. Download this script:
https://raw.githubusercontent.com/TheJJ/ceph-balancer/master/placementoptimizer.py
5. Run it as follows: ./placementoptimizer.py -v balance --osdsize
device --osdused delta --max-pg-moves 500 --osdfrom fullest | bash

This will move at most 500 PGs to better places, starting with the
fullest OSDs. All weights are ignored, and the switches take care of
the bluestore_min_alloc_size overhead mismatch. You will have to do
that weekly until you redeploy all OSDs that were created with 64K
bluestore_min_alloc_size.

A hybrid approach (initial round of balancing with TheJJ, then switch
to the built-in balancer) may also be viable.

On Sun, Mar 24, 2024 at 7:09 PM Denis Polom <denispolom@xxxxxxxxx> wrote:
>
> Hi guys,
>
> recently I took over a care of Ceph cluster that is extremely
> unbalanced. Cluster is running on Quincy 17.2.7 (upgraded Nautilus ->
> Octopus -> Quincy) and has 1428 OSDs (HDDs). We are running CephFS on it.
>
> Crush failure domain is datacenter (there are 3), data pool is EC 3+3.
>
> This cluster had and has balancer disabled for years. And was "balanced"
> manually by changing OSDs crush weights. So now it is complete mess and
> I would like to change it to have OSDs crush weight same (3.63898)  and
> to enable balancer with upmap.
>
>  From `ceph osd df ` sorted from the least used to most used OSDs:
>
> ID    CLASS  WEIGHT   REWEIGHT  SIZE     RAW USE  DATA OMAP     META
> AVAIL     %USE   VAR   PGS  STATUS
> MIN/MAX VAR: 0.76/1.16  STDDEV: 5.97
>                           TOTAL  5.1 PiB  3.7 PiB  3.7 PiB  2.9 MiB  8.5
> TiB   1.5 PiB  71.50
>   428    hdd  3.63898   1.00000  3.6 TiB  2.0 TiB  2.0 TiB    1 KiB  5.6
> GiB   1.7 TiB  54.55  0.76   96      up
>   223    hdd  3.63898   1.00000  3.6 TiB  2.0 TiB  2.0 TiB    3 KiB  5.6
> GiB   1.7 TiB  54.58  0.76   95      up
> ...
>
> ...
>
> ...
>
>   591    hdd  3.53999   1.00000  3.6 TiB  3.0 TiB  3.0 TiB    1 KiB  7.0
> GiB   680 GiB  81.74  1.14  125      up
>   832    hdd  3.59999   1.00000  3.6 TiB  3.0 TiB  3.0 TiB    4 KiB  6.9
> GiB   680 GiB  81.75  1.14  114      up
>   248    hdd  3.63898   1.00000  3.6 TiB  3.0 TiB  3.0 TiB    3 KiB  7.2
> GiB   646 GiB  82.67  1.16  121      up
>   559    hdd  3.63799   1.00000  3.6 TiB  3.0 TiB  3.0 TiB      0 B  7.0
> GiB   644 GiB  82.70  1.16  123      up
>                           TOTAL  5.1 PiB  3.7 PiB  3.6 PiB  2.9 MiB  8.5
> TiB   1.5 PiB  71.50
> MIN/MAX VAR: 0.76/1.16  STDDEV: 5.97
>
>
> crush rule:
>
> {
>      "rule_id": 10,
>      "rule_name": "ec33hdd_rule",
>      "type": 3,
>      "steps": [
>          {
>              "op": "set_chooseleaf_tries",
>              "num": 5
>          },
>          {
>              "op": "set_choose_tries",
>              "num": 100
>          },
>          {
>              "op": "take",
>              "item": -2,
>              "item_name": "default~hdd"
>          },
>          {
>              "op": "choose_indep",
>              "num": 3,
>              "type": "datacenter"
>          },
>          {
>              "op": "choose_indep",
>              "num": 2,
>              "type": "osd"
>          },
>          {
>              "op": "emit"
>          }
>      ]
> }
>
> My question is what would be proper and most safer way to make it happen.
>
> * should I first enable balancer and let it do its work and after that
> change the OSDs crush weights to be even?
>
> * or should it otherwise - first to make crush weights even and then
> enable the balancer?
>
> * or is there another safe(r) way?
>
> What are the ideal balancer settings for that?
>
> I'm expecting a large data movement, and this is production cluster.
>
> I'm also afraid that during the balancing or changing crush weights some
> OSDs become full. I've tried that already and had to move some PGs
> manually to another OSDs in the same failure domain.
>
>
> I would appreciate any suggestion on that.
>
> Thank you!
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx

-- 
Alexander E. Patrakov
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx