This is a natural condition of CRUSH. You don’t mention what release the back-end or the clients are running so it’s difficult to give an exact answer. Don’t mess with the CRUSH weights. Either adjust the override / reweights with `ceph osd test-reweight-by-utilization / reweight-by-utilization` https://docs.ceph.com/docs/master/rados/operations/control/ or use the balancer module in newer releases *iff* all clients are new enough to handle pg-upmap https://docs.ceph.com/docs/nautilus/rados/operations/balancer/ > On Jul 30, 2020, at 9:21 AM, Budai Laszlo <laszlo.budai@xxxxxxxxx> wrote: > > Dear all, > > We have a ceph cluster where we are have configured two SSD only pools in order to use them as cache tier for the spinning discs. Altogether there are 27 SSDs organized on 9 hosts distributed in 3 chassis. The hierarchy looks like this: > > $ ceph osd df tree | grep -E 'ssd|ID' > ID CLASS WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR PGS TYPE NAME > -40 8.26199 - 8.26TiB 5.78TiB 2.48TiB 70.02 5.77 - root ssd-root > -50 2.75400 - 2.75TiB 1.93TiB 845GiB 70.02 5.77 - chassis c1-ssd > -41 0.91800 - 940GiB 651GiB 289GiB 69.23 5.71 - host c1-h01-ssd > 110 ssd 0.30600 1.00000 313GiB 199GiB 115GiB 63.37 5.22 77 osd.110 > 116 ssd 0.30600 1.00000 313GiB 219GiB 94.3GiB 69.91 5.76 89 osd.116 > 119 ssd 0.30600 1.00000 313GiB 233GiB 80.2GiB 74.41 6.13 87 osd.119 > -42 0.91800 - 940GiB 701GiB 239GiB 74.61 6.15 - host c1-h02-ssd > 112 ssd 0.30600 1.00000 313GiB 228GiB 84.9GiB 72.91 6.01 85 osd.112 > 117 ssd 0.30600 1.00000 313GiB 245GiB 67.9GiB 78.32 6.46 97 osd.117 > 122 ssd 0.30600 1.00000 313GiB 227GiB 85.8GiB 72.61 5.99 87 osd.122 > -43 0.91800 - 940GiB 622GiB 318GiB 66.21 5.46 - host c1-h03-ssd > 109 ssd 0.30600 1.00000 313GiB 192GiB 122GiB 61.15 5.04 77 osd.109 > 115 ssd 0.30600 1.00000 313GiB 206GiB 107GiB 65.79 5.42 79 osd.115 > 120 ssd 0.30600 1.00000 313GiB 225GiB 88.7GiB 71.70 5.91 90 osd.120 > -51 2.75400 - 2.75TiB 1.93TiB 845GiB 70.02 5.77 - chassis c2-ssd > -46 0.91800 - 940GiB 651GiB 288GiB 69.31 5.71 - host c2-h01-ssd > 125 ssd 0.30600 1.00000 313GiB 211GiB 103GiB 67.22 5.54 81 osd.125 > 130 ssd 0.30600 1.00000 313GiB 233GiB 80.4GiB 74.33 6.13 89 osd.130 > 132 ssd 0.30600 1.00000 313GiB 208GiB 105GiB 66.38 5.47 79 osd.132 > -45 0.91800 - 940GiB 672GiB 267GiB 71.54 5.90 - host c2-h02-ssd > 126 ssd 0.30600 1.00000 313GiB 216GiB 97.4GiB 68.90 5.68 87 osd.126 > 129 ssd 0.30600 1.00000 313GiB 207GiB 106GiB 66.12 5.45 80 osd.129 > 134 ssd 0.30600 1.00000 313GiB 249GiB 63.9GiB 79.61 6.56 99 osd.134 > -44 0.91800 - 940GiB 650GiB 289GiB 69.20 5.70 - host c2-h03-ssd > 123 ssd 0.30600 1.00000 313GiB 201GiB 112GiB 64.23 5.29 76 osd.123 > 127 ssd 0.30600 1.00000 313GiB 217GiB 96.1GiB 69.31 5.71 85 osd.127 > 131 ssd 0.30600 1.00000 313GiB 232GiB 81.2GiB 74.06 6.11 92 osd.131 > -52 2.75400 - 2.75TiB 1.93TiB 845GiB 70.02 5.77 - chassis c3-ssd > -47 0.91800 - 940GiB 628GiB 311GiB 66.86 5.51 - host c3-h01-ssd > 124 ssd 0.30600 1.00000 313GiB 204GiB 109GiB 65.13 5.37 78 osd.124 > 128 ssd 0.30600 1.00000 313GiB 202GiB 111GiB 64.59 5.32 76 osd.128 > 133 ssd 0.30600 1.00000 313GiB 222GiB 91.3GiB 70.86 5.84 86 osd.133 > -48 0.91800 - 940GiB 628GiB 312GiB 66.80 5.51 - host c3-h02-ssd > 108 ssd 0.30600 1.00000 313GiB 220GiB 92.9GiB 70.35 5.80 86 osd.108 > 114 ssd 0.30600 1.00000 313GiB 209GiB 105GiB 66.58 5.49 82 osd.114 > 121 ssd 0.30600 1.00000 313GiB 199GiB 114GiB 63.46 5.23 79 osd.121 > -49 0.91800 - 940GiB 718GiB 222GiB 76.40 6.30 - host c3-h03-ssd > 111 ssd 0.30600 1.00000 313GiB 219GiB 94.4GiB 69.87 5.76 84 osd.111 > 113 ssd 0.30600 1.00000 313GiB 241GiB 72.2GiB 76.95 6.34 96 osd.113 > 118 ssd 0.30600 1.00000 313GiB 258GiB 55.2GiB 82.39 6.79 101 osd.118 > > > The rule used for the two pools is the following: > > { > "rule_id": 1, > "rule_name": "ssd", > "ruleset": 1, > "type": 1, > "min_size": 1, > "max_size": 10, > "steps": [ > { > "op": "take", > "item": -40, > "item_name": "ssd-root" > }, > { > "op": "chooseleaf_firstn", > "num": 0, > "type": "chassis" > }, > { > "op": "emit" > } > ] > } > > > both pools have the size 3, and the total number of PGs is 768 (256+512). > > As you can see from the previous table (the PG column) there is a significant difference between the OSD with the largest number of PGs (101PGs on osd.118) and the ones with the smallest number (76 PGs on osd.123). The ratio between the two is 1.32. So OSD 118 has more chances to receive data then OSD 123, and we can see that indeed osd.118 is the one storing the most data (82.39% full in the above table). > > I would like to re balance the PG/OSD allocation. I know that I can play around with the OSD weights (currently .306 for all the OSDs), but I wonder if there is any drawback for this on the long run? Are you aware of any reason why I should NOT modify the weights (and leave those modifications permanent)? > > Any ideas are welcome :) > > Kind regards, > Laszlo > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx