I ended up balancing my osdmap myself offline to figure out why the balancer couldn't do better. I had similar issues with osdmaptool, which of course is what I expected, but it's a lot easier to run osdmaptool in a debugger to see what's happening. When
I dug into the upmap code I discovered that my problem was due to the way that code balances OSDs. In my case the average PG count per OSD is 56.882, so as soon as any OSD had 56 PGs it wouldn't get any more no matter what I used as my max deviation. I got
into a state where each OSD had 56-61 PGs, and the upmap code wouldn't do any better because there were no "underfull" OSDs onto which to move PGs.
I made some changes to the osdmap code to insure the computed "overfull" and "underfull" OSD lists were the same size even if the least or most full OSDs were within the expected deviation in order to allow those outside of the expected deviation some
relief, and it worked nicely. I have two independent, production pools that were both in this state, and now every OSD across both pools has 56 or 57 PGs as expected.
I intend to put together a pull request to push this upstream. I haven't reviewed the balancer module code to see how it's doing things, but assuming it uses osdmaptool or the same upmap code as osdmaptool this should also improve the balancer module.
From the balancer module's code for v 12.2.7 I noticed [1] these lines
which reference [2] these 2 config options for upmap. You might try using
more max iterations or a smaller max deviation to see if you can get a
better balance in your cluster. I would try to start with [3] these
commands/values and see if it improves your balance and/or allows you to
generate a better map.
[1]
https://github.com/ceph/ceph/blob/v12.2.7/src/pybind/mgr/balancer/module.py#L671-L672
[2] upmap_max_iterations (default 10)
upmap_max_deviation (default .01)
[3] ceph config-key set mgr/balancer/upmap_max_iterations 50
ceph config-key set mgr/balancer/upmap_max_deviation .005
This was not help to my 12.2.8 cluster. When first iterations of balancing was performing I decreased max_misplaced from default 0.05 to 0.01. After this balancing operations was stopped.
After cluster is HEALTH_OK, I not see no any balancer run's. I'll try to lower balancer variables and restart mgr - message is still: "Error EALREADY: Unable to find further optimization,or distribution is already perfect"
# ceph config-key dump | grep balancer
"mgr/balancer/active": "1",
"mgr/balancer/max_misplaced": ".50",
"mgr/balancer/mode": "upmap",
"mgr/balancer/upmap_max_deviation": ".001",
"mgr/balancer/upmap_max_iterations": "100",
So may be I need delete upmaps and start over?
ID CLASS WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR PGS TYPE NAME
-1 414.00000 - 445TiB 129TiB 316TiB 29.01 1.00 - root default
-7 414.00000 - 445TiB 129TiB 316TiB 29.01 1.00 - datacenter rtcloud
-8 138.00000 - 148TiB 42.9TiB 105TiB 28.93 1.00 - rack rack2
-2 69.00000 - 74.2TiB 21.5TiB 52.7TiB 28.93 1.00 - host ceph-osd0
0 hdd 5.00000 1.00000 5.46TiB 1.64TiB 3.82TiB 30.06 1.04 62 osd.0
4 hdd 5.00000 1.00000 5.46TiB 1.65TiB 3.80TiB 30.29 1.04 64 osd.4
7 hdd 5.00000 1.00000 5.46TiB 1.61TiB 3.85TiB 29.44 1.01 63 osd.7
9 hdd 5.00000 1.00000 5.46TiB 1.68TiB 3.78TiB 30.77 1.06 63 osd.9
46 hdd 5.00000 1.00000 5.46TiB 1.68TiB 3.77TiB 30.86 1.06 65 osd.46
47 hdd 5.00000 1.00000 5.46TiB 1.68TiB 3.78TiB 30.73 1.06 66 osd.47
48 hdd 5.00000 1.00000 5.46TiB 1.65TiB 3.81TiB 30.22 1.04 66 osd.48
49 hdd 5.00000 1.00000 5.46TiB 1.71TiB 3.74TiB 31.41 1.08 65 osd.49
54 hdd 5.00000 1.00000 5.46TiB 1.64TiB 3.82TiB 30.08 1.04 65 osd.54
55 hdd 5.00000 1.00000 5.46TiB 1.65TiB 3.80TiB 30.30 1.04 64 osd.55
56 hdd 5.00000 1.00000 5.46TiB 1.66TiB 3.80TiB 30.35 1.05 64 osd.56
57 hdd 5.00000 1.00000 5.46TiB 1.63TiB 3.83TiB 29.81 1.03 64 osd.57
24 nvme 3.00000 1.00000 2.89TiB 559GiB 2.34TiB 18.88 0.65 63 osd.24
74 nvme 3.00000 1.00000 2.89TiB 526GiB 2.38TiB 17.76 0.61 67 osd.74
84 nvme 3.00000 1.00000 2.89TiB 522GiB 2.38TiB 17.63 0.61 66 osd.84
-6 69.00000 - 74.2TiB 21.5TiB 52.7TiB 28.94 1.00 - host ceph-osd2
12 hdd 5.00000 1.00000 5.46TiB 1.66TiB 3.80TiB 30.35 1.05 66 osd.12
15 hdd 5.00000 1.00000 5.46TiB 1.67TiB 3.79TiB 30.58 1.05 68 osd.15
18 hdd 5.00000 1.00000 5.46TiB 1.64TiB 3.82TiB 30.05 1.04 65 osd.18
19 hdd 5.00000 1.00000 5.46TiB 1.59TiB 3.86TiB 29.21 1.01 64 osd.19
50 hdd 5.00000 1.00000 5.46TiB 1.63TiB 3.83TiB 29.84 1.03 65 osd.50
51 hdd 5.00000 1.00000 5.46TiB 1.72TiB 3.74TiB 31.44 1.08 66 osd.51
52 hdd 5.00000 1.00000 5.46TiB 1.70TiB 3.75TiB 31.24 1.08 64 osd.52
53 hdd 5.00000 1.00000 5.46TiB 1.60TiB 3.86TiB 29.36 1.01 64 osd.53
58 hdd 5.00000 1.00000 5.46TiB 1.71TiB 3.74TiB 31.38 1.08 64 osd.58
59 hdd 5.00000 1.00000 5.46TiB 1.66TiB 3.80TiB 30.37 1.05 66 osd.59
60 hdd 5.00000 1.00000 5.46TiB 1.60TiB 3.85TiB 29.38 1.01 66 osd.60
61 hdd 5.00000 1.00000 5.46TiB 1.71TiB 3.75TiB 31.31 1.08 66 osd.61
75 nvme 3.00000 1.00000 2.89TiB 535GiB 2.37TiB 18.06 0.62 62 osd.75
85 nvme 3.00000 1.00000 2.89TiB 510GiB 2.39TiB 17.24 0.59 63 osd.85
86 nvme 3.00000 1.00000 2.89TiB 560GiB 2.34TiB 18.92 0.65 66 osd.86
-9 138.00000 - 148TiB 43.2TiB 105TiB 29.10 1.00 - rack rack3
-3 69.00000 - 74.2TiB 21.6TiB 52.5TiB 29.18 1.01 - host ceph-osd3
20 hdd 5.00000 1.00000 5.46TiB 1.66TiB 3.79TiB 30.50 1.05 69 osd.20
21 hdd 5.00000 1.00000 5.46TiB 1.68TiB 3.77TiB 30.85 1.06 64 osd.21
22 hdd 5.00000 1.00000 5.46TiB 1.72TiB 3.74TiB 31.43 1.08 65 osd.22
23 hdd 5.00000 1.00000 5.46TiB 1.62TiB 3.83TiB 29.75 1.03 64 osd.23
34 hdd 5.00000 1.00000 5.46TiB 1.66TiB 3.80TiB 30.33 1.05 65 osd.34
35 hdd 5.00000 1.00000 5.46TiB 1.66TiB 3.79TiB 30.47 1.05 65 osd.35
36 hdd 5.00000 1.00000 5.46TiB 1.67TiB 3.79TiB 30.54 1.05 65 osd.36
37 hdd 5.00000 1.00000 5.46TiB 1.69TiB 3.76TiB 31.03 1.07 64 osd.37
62 hdd 5.00000 1.00000 5.46TiB 1.71TiB 3.74TiB 31.42 1.08 67 osd.62
63 hdd 5.00000 1.00000 5.46TiB 1.66TiB 3.80TiB 30.43 1.05 66 osd.63
64 hdd 5.00000 1.00000 5.46TiB 1.66TiB 3.80TiB 30.38 1.05 65 osd.64
65 hdd 5.00000 1.00000 5.46TiB 1.64TiB 3.82TiB 30.10 1.04 64 osd.65
30 nvme 3.00000 1.00000 2.89TiB 562GiB 2.34TiB 19.00 0.65 65 osd.30
76 nvme 3.00000 1.00000 2.89TiB 531GiB 2.37TiB 17.96 0.62 65 osd.76
88 nvme 3.00000 1.00000 2.89TiB 546GiB 2.36TiB 18.43 0.64 68 osd.88
-11 69.00000 - 74.2TiB 21.5TiB 52.6TiB 29.01 1.00 - host ceph-osd5
10 hdd 5.00000 1.00000 5.46TiB 1.71TiB 3.74TiB 31.39 1.08 64 osd.10
13 hdd 5.00000 1.00000 5.46TiB 1.68TiB 3.78TiB 30.77 1.06 65 osd.13
16 hdd 5.00000 1.00000 5.46TiB 1.71TiB 3.75TiB 31.32 1.08 64 osd.16
17 hdd 5.00000 1.00000 5.46TiB 1.71TiB 3.74TiB 31.42 1.08 69 osd.17
27 hdd 5.00000 1.00000 5.46TiB 1.62TiB 3.84TiB 29.62 1.02 62 osd.27
31 hdd 5.00000 1.00000 5.46TiB 1.72TiB 3.74TiB 31.53 1.09 64 osd.31
32 hdd 5.00000 1.00000 5.46TiB 1.63TiB 3.83TiB 29.78 1.03 65 osd.32
33 hdd 5.00000 1.00000 5.46TiB 1.68TiB 3.78TiB 30.73 1.06 67 osd.33
70 hdd 5.00000 1.00000 5.46TiB 1.65TiB 3.80TiB 30.31 1.05 65 osd.70
71 hdd 5.00000 1.00000 5.46TiB 1.64TiB 3.82TiB 30.06 1.04 65 osd.71
72 hdd 5.00000 1.00000 5.46TiB 1.56TiB 3.90TiB 28.61 0.99 61 osd.72
73 hdd 5.00000 1.00000 5.46TiB 1.60TiB 3.86TiB 29.27 1.01 65 osd.73
29 nvme 3.00000 1.00000 2.89TiB 541GiB 2.36TiB 18.27 0.63 65 osd.29
78 nvme 3.00000 1.00000 2.89TiB 541GiB 2.36TiB 18.28 0.63 64 osd.78
89 nvme 3.00000 1.00000 2.89TiB 562GiB 2.34TiB 19.00 0.66 63 osd.89
-10 138.00000 - 148TiB 43.0TiB 105TiB 28.99 1.00 - rack rack4
-4 69.00000 - 74.2TiB 21.2TiB 52.9TiB 28.65 0.99 - host ceph-osd1
1 hdd 5.00000 1.00000 5.46TiB 1.65TiB 3.81TiB 30.16 1.04 67 osd.1
2 hdd 5.00000 1.00000 5.46TiB 1.63TiB 3.83TiB 29.82 1.03 64 osd.2
3 hdd 5.00000 1.00000 5.46TiB 1.62TiB 3.83TiB 29.74 1.03 62 osd.3
5 hdd 5.00000 1.00000 5.46TiB 1.59TiB 3.86TiB 29.20 1.01 63 osd.5
38 hdd 5.00000 1.00000 5.46TiB 1.64TiB 3.82TiB 30.02 1.03 63 osd.38
39 hdd 5.00000 1.00000 5.46TiB 1.62TiB 3.83TiB 29.77 1.03 63 osd.39
40 hdd 5.00000 1.00000 5.46TiB 1.68TiB 3.78TiB 30.79 1.06 63 osd.40
41 hdd 5.00000 1.00000 5.46TiB 1.69TiB 3.76TiB 31.04 1.07 67 osd.41
80 hdd 5.00000 1.00000 5.46TiB 1.65TiB 3.81TiB 30.17 1.04 69 osd.80
81 hdd 5.00000 1.00000 5.46TiB 1.61TiB 3.84TiB 29.56 1.02 68 osd.81
82 hdd 5.00000 1.00000 5.46TiB 1.70TiB 3.76TiB 31.06 1.07 65 osd.82
83 hdd 5.00000 1.00000 5.46TiB 1.56TiB 3.90TiB 28.58 0.99 65 osd.83
25 nvme 3.00000 1.00000 2.89TiB 558GiB 2.34TiB 18.87 0.65 65 osd.25
79 nvme 3.00000 1.00000 2.89TiB 541GiB 2.36TiB 18.29 0.63 63 osd.79
87 nvme 3.00000 1.00000 2.89TiB 540GiB 2.36TiB 18.26 0.63 65 osd.87
-5 69.00000 - 74.2TiB 21.8TiB 52.4TiB 29.34 1.01 - host ceph-osd4
6 hdd 5.00000 1.00000 5.46TiB 1.67TiB 3.79TiB 30.62 1.06 66 osd.6
8 hdd 5.00000 1.00000 5.46TiB 1.64TiB 3.82TiB 30.04 1.04 63 osd.8
11 hdd 5.00000 1.00000 5.46TiB 1.66TiB 3.80TiB 30.41 1.05 66 osd.11
14 hdd 5.00000 1.00000 5.46TiB 1.71TiB 3.75TiB 31.36 1.08 66 osd.14
42 hdd 5.00000 1.00000 5.46TiB 1.69TiB 3.77TiB 30.95 1.07 65 osd.42
43 hdd 5.00000 1.00000 5.46TiB 1.71TiB 3.75TiB 31.36 1.08 65 osd.43
44 hdd 5.00000 1.00000 5.46TiB 1.67TiB 3.78TiB 30.66 1.06 67 osd.44
45 hdd 5.00000 1.00000 5.46TiB 1.66TiB 3.80TiB 30.38 1.05 62 osd.45
66 hdd 5.00000 1.00000 5.46TiB 1.66TiB 3.80TiB 30.33 1.05 65 osd.66
67 hdd 5.00000 1.00000 5.46TiB 1.72TiB 3.74TiB 31.43 1.08 67 osd.67
68 hdd 5.00000 1.00000 5.46TiB 1.71TiB 3.75TiB 31.32 1.08 65 osd.68
69 hdd 5.00000 1.00000 5.46TiB 1.64TiB 3.82TiB 30.10 1.04 62 osd.69
26 nvme 3.00000 1.00000 2.89TiB 559GiB 2.34TiB 18.89 0.65 66 osd.26
28 nvme 3.00000 1.00000 2.89TiB 563GiB 2.34TiB 19.01 0.66 66 osd.28
77 nvme 3.00000 1.00000 2.89TiB 541GiB 2.36TiB 18.30 0.63 66 osd.77
TOTAL 445TiB 129TiB 316TiB 29.01
MIN/MAX VAR: 0.59/1.09 STDDEV: 4.96
k