Re: ceph mgr balancer bad distribution

Dan van der Ster <dan@xxxxxxxxxxxxxx> · Thu, 1 Mar 2018 10:38:55 +0100

On Thu, Mar 1, 2018 at 10:24 AM, Stefan Priebe - Profihost AG
<s.priebe@xxxxxxxxxxxx> wrote:
>
> Am 01.03.2018 um 09:58 schrieb Dan van der Ster:
>> On Thu, Mar 1, 2018 at 9:52 AM, Stefan Priebe - Profihost AG
>> <s.priebe@xxxxxxxxxxxx> wrote:
>>> Hi,
>>>
>>> Am 01.03.2018 um 09:42 schrieb Dan van der Ster:
>>>> On Thu, Mar 1, 2018 at 9:31 AM, Stefan Priebe - Profihost AG
>>>> <s.priebe@xxxxxxxxxxxx> wrote:
>>>>> Hi,
>>>>> Am 01.03.2018 um 09:03 schrieb Dan van der Ster:
>>>>>> Is the score improving?
>>>>>>
>>>>>>     ceph balancer eval
>>>>>>
>>>>>> It should be decreasing over time as the variances drop toward zero.
>>>>>>
>>>>>> You mentioned a crush optimize code at the beginning... how did that
>>>>>> leave your cluster? The mgr balancer assumes that the crush weight of
>>>>>> each OSD is equal to its size in TB.
>>>>>> Do you have any osd reweights? crush-compat will gradually adjust
>>>>>> those back to 1.0.
>>>>>
>>>>> I reweighted them all back to their correct weight.
>>>>>
>>>>> Now the mgr balancer module says:
>>>>> mgr[balancer] Failed to find further optimization, score 0.010646
>>>>>
>>>>> But as you can see it's heavily imbalanced:
>>>>>
>>>>>
>>>>> Example:
>>>>> 49   ssd 0.84000  1.00000   864G   546G   317G 63.26 1.13  49
>>>>>
>>>>> vs:
>>>>>
>>>>> 48   ssd 0.84000  1.00000   864G   397G   467G 45.96 0.82  49
>>>>>
>>>>> 45% usage vs. 63%
>>>>
>>>> Ahh... but look, the num PGs are perfectly balanced, which implies
>>>> that you have a relatively large number of empty PGs.
>>>>
>>>> But regardless, this is annoying and I expect lots of operators to get
>>>> this result. (I've also observed that the num PGs is gets balanced
>>>> perfectly at the expense of the other score metrics.)
>>>>
>>>> I was thinking of a patch around here [1] that lets operators add a
>>>> score weight on pgs, objects, bytes so we can balance how we like.
>>>>
>>>> Spandan: you were the last to look at this function. Do you think it
>>>> can be improved as I suggested?
>>>
>>> Yes the PGs are perfectly distributed - but i think most of the people
>>> would like to have a dsitribution by bytes and not pgs.
>>>
>>> Is this possible? I mean in the code there is already a dict for pgs,
>>> objects and bytes - but i don't know how to change the logic. Just
>>> remove the pgs and objects from the dict?
>>
>> It's worth a try to remove the pgs and objects from this dict:
>> https://github.com/ceph/ceph/blob/luminous/src/pybind/mgr/balancer/module.py#L552
>
> Do i have to change this 3 to 1 cause we have only one item in the dict?
> I'm not sure where the 3 comes from.
>         pe.score /= 3 * len(roots)
>

I'm pretty sure that 3 is just for our 3 metrics. Indeed you can
change that to 1.

I'm trying this on our test cluster here too. The last few lines of
output from `ceph balancer eval-verbose` will confirm that the score
is based only on bytes.

But I'm not sure this is going to work -- indeed the score here went
from ~0.02 to 0.08, but the do_crush_compat doesn't manage to find a
better score.

-- Dan
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com