Re: ceph cluster extremely unbalanced

Matt Vandermeulen <storage@xxxxxxxxxxxx> · Sun, 24 Mar 2024 08:47:27 -0300

Hi,

I would expect that almost every PG in the cluster is going to have to 
move once you start standardizing CRUSH weights, and I wouldn't want to 
move data twice. My plan would look something like:

- Make sure the cluster is healthy (no degraded PGs)
- Set nobackfill, norebalance flags to prevent any data from moving
- Set your CRUSH weights (this will cause PGs to re-peer, which will 
stall IO during the peering process; I think this could be done in one 
large operation/osdmap update by changing the CRUSH map directly)
- Wait for peering to settle and IO rates to recover
- Use pgremapper[1] to cancel backfill, which will insert upmaps to keep 
the data where it is today (pgremapper cancel-backfill --verbose --yes)
- You could simply enable the balancer at this point if you want a "set 
it and forget it" type of thing, or if you want more control you can use 
pgremapper undo-upmaps in a loop

With a ~5P cluster, this is going to take a while, and I'd probably 
expect to lose some drives while data is moving.

[1] https://github.com/digitalocean/pgremapper

On 2024-03-24 08:06, Denis Polom wrote:
Hi guys,

recently I took over a care of Ceph cluster that is extremely 
unbalanced. Cluster is running on Quincy 17.2.7 (upgraded Nautilus -> 
Octopus -> Quincy) and has 1428 OSDs (HDDs). We are running CephFS on 
it.

Crush failure domain is datacenter (there are 3), data pool is EC 3+3.

This cluster had and has balancer disabled for years. And was 
"balanced" manually by changing OSDs crush weights. So now it is 
complete mess and I would like to change it to have OSDs crush weight 
same (3.63898)  and to enable balancer with upmap.

From `ceph osd df ` sorted from the least used to most used OSDs:

ID    CLASS  WEIGHT   REWEIGHT  SIZE     RAW USE  DATA OMAP     
META     AVAIL     %USE   VAR   PGS  STATUS
MIN/MAX VAR: 0.76/1.16  STDDEV: 5.97
                         TOTAL  5.1 PiB  3.7 PiB  3.7 PiB  2.9 MiB  8.5 
TiB   1.5 PiB  71.50
 428    hdd  3.63898   1.00000  3.6 TiB  2.0 TiB  2.0 TiB    1 KiB  5.6 
GiB   1.7 TiB  54.55  0.76   96      up
 223    hdd  3.63898   1.00000  3.6 TiB  2.0 TiB  2.0 TiB    3 KiB  5.6 
GiB   1.7 TiB  54.58  0.76   95      up
...

...

...

 591    hdd  3.53999   1.00000  3.6 TiB  3.0 TiB  3.0 TiB    1 KiB  7.0 
GiB   680 GiB  81.74  1.14  125      up
 832    hdd  3.59999   1.00000  3.6 TiB  3.0 TiB  3.0 TiB    4 KiB  6.9 
GiB   680 GiB  81.75  1.14  114      up
 248    hdd  3.63898   1.00000  3.6 TiB  3.0 TiB  3.0 TiB    3 KiB  7.2 
GiB   646 GiB  82.67  1.16  121      up
 559    hdd  3.63799   1.00000  3.6 TiB  3.0 TiB  3.0 TiB      0 B  7.0 
GiB   644 GiB  82.70  1.16  123      up
                         TOTAL  5.1 PiB  3.7 PiB  3.6 PiB  2.9 MiB  8.5 
TiB   1.5 PiB  71.50
MIN/MAX VAR: 0.76/1.16  STDDEV: 5.97

crush rule:

{
    "rule_id": 10,
    "rule_name": "ec33hdd_rule",
    "type": 3,
    "steps": [
        {
            "op": "set_chooseleaf_tries",
            "num": 5
        },
        {
            "op": "set_choose_tries",
            "num": 100
        },
        {
            "op": "take",
            "item": -2,
            "item_name": "default~hdd"
        },
        {
            "op": "choose_indep",
            "num": 3,
            "type": "datacenter"
        },
        {
            "op": "choose_indep",
            "num": 2,
            "type": "osd"
        },
        {
            "op": "emit"
        }
    ]
}

My question is what would be proper and most safer way to make it 
happen.

* should I first enable balancer and let it do its work and after that 
change the OSDs crush weights to be even?

* or should it otherwise - first to make crush weights even and then 
enable the balancer?

* or is there another safe(r) way?

What are the ideal balancer settings for that?

I'm expecting a large data movement, and this is production cluster.

I'm also afraid that during the balancing or changing crush weights 
some OSDs become full. I've tried that already and had to move some PGs 
manually to another OSDs in the same failure domain.

I would appreciate any suggestion on that.

Thank you!
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx