Re: unbalanced pg/osd allocation

Budai Laszlo <laszlo.budai@xxxxxxxxx> · Fri, 31 Jul 2020 08:27:15 +0300

Hello Anthony,

thank you for your answer. I forgot to mention the version. It's Luminous (12.2.9), the clients are OpenStack (Queens) VMs.

Kind regards,
Laszlo

On 7/30/20 8:59 PM, Anthony D'Atri wrote:
> This is a natural condition of CRUSH.  You don’t mention what release the back-end or the clients are running so it’s difficult to give an exact answer.
> 
> Don’t mess with the CRUSH weights.
> 
> Either adjust the override / reweights with `ceph osd test-reweight-by-utilization / reweight-by-utilization`
> 
> https://docs.ceph.com/docs/master/rados/operations/control/
> 
> 
> or use the balancer module in newer releases *iff* all clients are new enough to handle pg-upmap
> 
> https://docs.ceph.com/docs/nautilus/rados/operations/balancer/
> 
> 
> 
> 
> 
> 
>> On Jul 30, 2020, at 9:21 AM, Budai Laszlo <laszlo.budai@xxxxxxxxx> wrote:
>>
>> Dear all,
>>
>> We have a ceph cluster where we are have configured two SSD only pools in order to use them as cache tier for the spinning discs. Altogether there are 27 SSDs organized on 9 hosts distributed in 3 chassis. The hierarchy looks like this:
>>
>> $ ceph osd df tree | grep -E 'ssd|ID'
>> ID  CLASS WEIGHT    REWEIGHT SIZE    USE     AVAIL   %USE  VAR  PGS TYPE NAME                   
>> -40         8.26199        - 8.26TiB 5.78TiB 2.48TiB 70.02 5.77   - root ssd-root               
>> -50         2.75400        - 2.75TiB 1.93TiB  845GiB 70.02 5.77   -     chassis c1-ssd          
>> -41         0.91800        -  940GiB  651GiB  289GiB 69.23 5.71   -         host c1-h01-ssd 
>> 110   ssd   0.30600  1.00000  313GiB  199GiB  115GiB 63.37 5.22  77             osd.110         
>> 116   ssd   0.30600  1.00000  313GiB  219GiB 94.3GiB 69.91 5.76  89             osd.116         
>> 119   ssd   0.30600  1.00000  313GiB  233GiB 80.2GiB 74.41 6.13  87             osd.119         
>> -42         0.91800        -  940GiB  701GiB  239GiB 74.61 6.15   -         host c1-h02-ssd 
>> 112   ssd   0.30600  1.00000  313GiB  228GiB 84.9GiB 72.91 6.01  85             osd.112         
>> 117   ssd   0.30600  1.00000  313GiB  245GiB 67.9GiB 78.32 6.46  97             osd.117         
>> 122   ssd   0.30600  1.00000  313GiB  227GiB 85.8GiB 72.61 5.99  87             osd.122         
>> -43         0.91800        -  940GiB  622GiB  318GiB 66.21 5.46   -         host c1-h03-ssd 
>> 109   ssd   0.30600  1.00000  313GiB  192GiB  122GiB 61.15 5.04  77             osd.109         
>> 115   ssd   0.30600  1.00000  313GiB  206GiB  107GiB 65.79 5.42  79             osd.115         
>> 120   ssd   0.30600  1.00000  313GiB  225GiB 88.7GiB 71.70 5.91  90             osd.120         
>> -51         2.75400        - 2.75TiB 1.93TiB  845GiB 70.02 5.77   -     chassis c2-ssd          
>> -46         0.91800        -  940GiB  651GiB  288GiB 69.31 5.71   -         host c2-h01-ssd 
>> 125   ssd   0.30600  1.00000  313GiB  211GiB  103GiB 67.22 5.54  81             osd.125         
>> 130   ssd   0.30600  1.00000  313GiB  233GiB 80.4GiB 74.33 6.13  89             osd.130         
>> 132   ssd   0.30600  1.00000  313GiB  208GiB  105GiB 66.38 5.47  79             osd.132         
>> -45         0.91800        -  940GiB  672GiB  267GiB 71.54 5.90   -         host c2-h02-ssd 
>> 126   ssd   0.30600  1.00000  313GiB  216GiB 97.4GiB 68.90 5.68  87             osd.126         
>> 129   ssd   0.30600  1.00000  313GiB  207GiB  106GiB 66.12 5.45  80             osd.129         
>> 134   ssd   0.30600  1.00000  313GiB  249GiB 63.9GiB 79.61 6.56  99             osd.134         
>> -44         0.91800        -  940GiB  650GiB  289GiB 69.20 5.70   -         host c2-h03-ssd 
>> 123   ssd   0.30600  1.00000  313GiB  201GiB  112GiB 64.23 5.29  76             osd.123         
>> 127   ssd   0.30600  1.00000  313GiB  217GiB 96.1GiB 69.31 5.71  85             osd.127         
>> 131   ssd   0.30600  1.00000  313GiB  232GiB 81.2GiB 74.06 6.11  92             osd.131         
>> -52         2.75400        - 2.75TiB 1.93TiB  845GiB 70.02 5.77   -     chassis c3-ssd          
>> -47         0.91800        -  940GiB  628GiB  311GiB 66.86 5.51   -         host c3-h01-ssd 
>> 124   ssd   0.30600  1.00000  313GiB  204GiB  109GiB 65.13 5.37  78             osd.124         
>> 128   ssd   0.30600  1.00000  313GiB  202GiB  111GiB 64.59 5.32  76             osd.128         
>> 133   ssd   0.30600  1.00000  313GiB  222GiB 91.3GiB 70.86 5.84  86             osd.133         
>> -48         0.91800        -  940GiB  628GiB  312GiB 66.80 5.51   -         host c3-h02-ssd 
>> 108   ssd   0.30600  1.00000  313GiB  220GiB 92.9GiB 70.35 5.80  86             osd.108         
>> 114   ssd   0.30600  1.00000  313GiB  209GiB  105GiB 66.58 5.49  82             osd.114         
>> 121   ssd   0.30600  1.00000  313GiB  199GiB  114GiB 63.46 5.23  79             osd.121         
>> -49         0.91800        -  940GiB  718GiB  222GiB 76.40 6.30   -         host c3-h03-ssd 
>> 111   ssd   0.30600  1.00000  313GiB  219GiB 94.4GiB 69.87 5.76  84             osd.111         
>> 113   ssd   0.30600  1.00000  313GiB  241GiB 72.2GiB 76.95 6.34  96             osd.113         
>> 118   ssd   0.30600  1.00000  313GiB  258GiB 55.2GiB 82.39 6.79 101             osd.118
>>
>>
>> The rule used for the two pools is the following:
>>
>>        {
>>            "rule_id": 1,
>>            "rule_name": "ssd",
>>            "ruleset": 1,
>>            "type": 1,
>>            "min_size": 1,
>>            "max_size": 10,
>>            "steps": [
>>                {
>>                    "op": "take",
>>                    "item": -40,
>>                    "item_name": "ssd-root"
>>                },
>>                {
>>                    "op": "chooseleaf_firstn",
>>                    "num": 0,
>>                    "type": "chassis"
>>                },
>>                {
>>                    "op": "emit"
>>                }
>>            ]
>>        }
>>
>>
>> both pools have the size 3, and the total number of PGs is 768 (256+512). 
>>
>> As you can see from the previous table (the PG column) there is a significant difference between the OSD with the largest number of PGs (101PGs on osd.118) and the ones with the smallest number (76 PGs on osd.123). The ratio between the two is 1.32. So OSD 118 has more chances to receive data then OSD 123, and we can see that indeed osd.118 is the one storing the most data (82.39% full in the above table).
>>
>> I would like to re balance the PG/OSD allocation. I know that I can play around with the OSD weights (currently .306 for all the OSDs), but I wonder if there is any drawback for this on the long run? Are you aware of any reason why I should NOT modify the weights (and leave those modifications permanent)?
>>
>> Any ideas are welcome :)
>>
>> Kind regards,
>> Laszlo
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> 
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx