Hi Denis, As the vast majority of OSDs have bluestore_min_alloc_size = 65536, I think you can safely ignore https://tracker.ceph.com/issues/64715. The only consequence will be that 58 OSDs will be less full than others. In other words, please use either the hybrid approach or the built-in balancer right away. As for migrating to the modern defaults for bluestore_min_alloc_size, yes, recreating OSDs host-by-host (once you have the cluster balanced) is the only way. You can keep using the built-in balancer while doing that. On Mon, Mar 25, 2024 at 5:04 PM Denis Polom <denispolom@xxxxxxxxx> wrote: > > Hi Alexander, > > that sounds pretty promising to me. > > I've checked bluestore_min_alloc_size and most 1370 OSDs have value 65536. > > You mentioned: "You will have to do that weekly until you redeploy all > OSDs that were created with 64K bluestore_min_alloc_size" > > Is it the only way to approach this, that each OSD has to be recreated? > > Thank you for reply > > dp > > On 3/24/24 12:44 PM, Alexander E. Patrakov wrote: > > Hi Denis, > > > > My approach would be: > > > > 1. Run "ceph osd metadata" and see if you have a mix of 64K and 4K > > bluestore_min_alloc_size. If so, you cannot really use the built-in > > balancer, as it would result in a bimodal distribution instead of a > > proper balance, see https://tracker.ceph.com/issues/64715, but let's > > ignore this little issue if you have enough free space. > > 2. Change the weights as appropriate. Make absolutely sure that there > > are no reweights other than 1.0. Delete all dead or destroyed OSDs > > from the CRUSH map by purging them. Ignore any PG_BACKFILL_FULL > > warnings that appear, they will be gone during the next step. > > 3. Run this little script from Cern to stop the data movement that was > > just initiated: > > https://raw.githubusercontent.com/cernceph/ceph-scripts/master/tools/upmap/upmap-remapped.py, > > pipe its output to bash. This should cancel most of the data movement, > > but not all - the script cannot stop the situation when two OSDs want > > to exchange their erasure-coded shards, like this: [1,2,3,4] -> > > [1,3,2,4]. > > 4. Set the "target max misplaced ratio" option for MGR to what you > > think is appropriate. The default is 0.05, and this means that the > > balancer will enable at most 5% of the PGs to participate in the data > > movement. I suggest starting with 0.01 and increasing if there is no > > visible impact of the balancing on the client traffic. > > 5. Enable the balancer. > > > > If you think that https://tracker.ceph.com/issues/64715 is a problem > > that would prevent you from using the built-in balancer: > > > > 4. Download this script: > > https://raw.githubusercontent.com/TheJJ/ceph-balancer/master/placementoptimizer.py > > 5. Run it as follows: ./placementoptimizer.py -v balance --osdsize > > device --osdused delta --max-pg-moves 500 --osdfrom fullest | bash > > > > This will move at most 500 PGs to better places, starting with the > > fullest OSDs. All weights are ignored, and the switches take care of > > the bluestore_min_alloc_size overhead mismatch. You will have to do > > that weekly until you redeploy all OSDs that were created with 64K > > bluestore_min_alloc_size. > > > > A hybrid approach (initial round of balancing with TheJJ, then switch > > to the built-in balancer) may also be viable. > > > > On Sun, Mar 24, 2024 at 7:09 PM Denis Polom <denispolom@xxxxxxxxx> wrote: > >> Hi guys, > >> > >> recently I took over a care of Ceph cluster that is extremely > >> unbalanced. Cluster is running on Quincy 17.2.7 (upgraded Nautilus -> > >> Octopus -> Quincy) and has 1428 OSDs (HDDs). We are running CephFS on it. > >> > >> Crush failure domain is datacenter (there are 3), data pool is EC 3+3. > >> > >> This cluster had and has balancer disabled for years. And was "balanced" > >> manually by changing OSDs crush weights. So now it is complete mess and > >> I would like to change it to have OSDs crush weight same (3.63898) and > >> to enable balancer with upmap. > >> > >> From `ceph osd df ` sorted from the least used to most used OSDs: > >> > >> ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META > >> AVAIL %USE VAR PGS STATUS > >> MIN/MAX VAR: 0.76/1.16 STDDEV: 5.97 > >> TOTAL 5.1 PiB 3.7 PiB 3.7 PiB 2.9 MiB 8.5 > >> TiB 1.5 PiB 71.50 > >> 428 hdd 3.63898 1.00000 3.6 TiB 2.0 TiB 2.0 TiB 1 KiB 5.6 > >> GiB 1.7 TiB 54.55 0.76 96 up > >> 223 hdd 3.63898 1.00000 3.6 TiB 2.0 TiB 2.0 TiB 3 KiB 5.6 > >> GiB 1.7 TiB 54.58 0.76 95 up > >> ... > >> > >> ... > >> > >> ... > >> > >> 591 hdd 3.53999 1.00000 3.6 TiB 3.0 TiB 3.0 TiB 1 KiB 7.0 > >> GiB 680 GiB 81.74 1.14 125 up > >> 832 hdd 3.59999 1.00000 3.6 TiB 3.0 TiB 3.0 TiB 4 KiB 6.9 > >> GiB 680 GiB 81.75 1.14 114 up > >> 248 hdd 3.63898 1.00000 3.6 TiB 3.0 TiB 3.0 TiB 3 KiB 7.2 > >> GiB 646 GiB 82.67 1.16 121 up > >> 559 hdd 3.63799 1.00000 3.6 TiB 3.0 TiB 3.0 TiB 0 B 7.0 > >> GiB 644 GiB 82.70 1.16 123 up > >> TOTAL 5.1 PiB 3.7 PiB 3.6 PiB 2.9 MiB 8.5 > >> TiB 1.5 PiB 71.50 > >> MIN/MAX VAR: 0.76/1.16 STDDEV: 5.97 > >> > >> > >> crush rule: > >> > >> { > >> "rule_id": 10, > >> "rule_name": "ec33hdd_rule", > >> "type": 3, > >> "steps": [ > >> { > >> "op": "set_chooseleaf_tries", > >> "num": 5 > >> }, > >> { > >> "op": "set_choose_tries", > >> "num": 100 > >> }, > >> { > >> "op": "take", > >> "item": -2, > >> "item_name": "default~hdd" > >> }, > >> { > >> "op": "choose_indep", > >> "num": 3, > >> "type": "datacenter" > >> }, > >> { > >> "op": "choose_indep", > >> "num": 2, > >> "type": "osd" > >> }, > >> { > >> "op": "emit" > >> } > >> ] > >> } > >> > >> My question is what would be proper and most safer way to make it happen. > >> > >> * should I first enable balancer and let it do its work and after that > >> change the OSDs crush weights to be even? > >> > >> * or should it otherwise - first to make crush weights even and then > >> enable the balancer? > >> > >> * or is there another safe(r) way? > >> > >> What are the ideal balancer settings for that? > >> > >> I'm expecting a large data movement, and this is production cluster. > >> > >> I'm also afraid that during the balancing or changing crush weights some > >> OSDs become full. I've tried that already and had to move some PGs > >> manually to another OSDs in the same failure domain. > >> > >> > >> I would appreciate any suggestion on that. > >> > >> Thank you! > >> _______________________________________________ > >> ceph-users mailing list -- ceph-users@xxxxxxx > >> To unsubscribe send an email to ceph-users-leave@xxxxxxx > > > > -- Alexander E. Patrakov _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx