Re: osd out vs crush reweight]

"Marcel Kuiper" <ceph@xxxxxxxx> · Tue, 21 Jul 2020 20:49:03 +0200

Hi Dominiq

I must say that I inherited this cluster and did not develop the cursh
rule used. The rule reads:

        "rule_id": 1,
        "rule_name": "hdd",
        "ruleset": 1,
        "type": 1,
        "min_size": 2,
        "max_size": 3,
        "steps": [
            {
                "op": "take",
                "item": -31,
                "item_name": "DC3"
            },
            {
                "op": "choose_firstn",
                "num": 0,
                "type": "room"
            },
            {
                "op": "chooseleaf_firstn",
                "num": 1,
                "type": "host"
            },

Doesn't that say it will choose DC3, then a room within DC3 and then a
host? (I agree that racks in the tree are superfluous, but it does not
harm either)

Anyway thanks for your effort. I hope someone else can explain why setting
the crushweight of an osd to 0 results in surprisingly much PG's going to
other osd;s on the same node instead of going to other nodes

Marcel

> Marcel;
>
> To answer your question, I don't see anything that would be keeping these
> PGs on the same node.  Someone with more knowledge of how the Crush rules
> are applied, and the code around these operations, would need to weigh in.
>
> I am somewhat curious though; you define racks, and even rooms in your
> tree, but your failure domain is set to host.  Is that intentional?
>
> Thank you,
>
> Dominic L. Hilsbos, MBA
> Director - Information Technology
> Perform Air International, Inc.
> DHilsbos@xxxxxxxxxxxxxx
> www.PerformAir.com
>
>
>
> -----Original Message-----
> From: Marcel Kuiper [mailto:ceph@xxxxxxxx]
> Sent: Tuesday, July 21, 2020 10:14 AM
> To: ceph-users@xxxxxxx
> Cc: Dominic Hilsbos
> Subject: Re:  Re: osd out vs crush reweight]
>
> Dominic
>
> The crush rule dump and tree are attached (hope that works). All pools use
> crush_rule 1
>
> Marcel
>
>> Marcel;
>>
>> Sorry, could also send the output of:
>> ceph osd tree
>>
>> Thank you,
>>
>> Dominic L. Hilsbos, MBA
>> Director - Information Technology
>> Perform Air International, Inc.
>> DHilsbos@xxxxxxxxxxxxxx
>> www.PerformAir.com
>>
>>
>>
>> -----Original Message-----
>> From: DHilsbos@xxxxxxxxxxxxxx [mailto:DHilsbos@xxxxxxxxxxxxxx]
>> Sent: Tuesday, July 21, 2020 9:41 AM
>> To: ceph@xxxxxxxx; ceph-users@xxxxxxx
>> Subject:  Re: osd out vs crush reweight]
>>
>> Marcel;
>>
>> Thank you for the information.
>>
>> Could you send the output of:
>> ceph osd crush rule dump
>>
>> Thank you,
>>
>> Dominic L. Hilsbos, MBA
>> Director - Information Technology
>> Perform Air International, Inc.
>> DHilsbos@xxxxxxxxxxxxxx
>> www.PerformAir.com
>>
>>
>>
>> -----Original Message-----
>> From: Marcel Kuiper [mailto:ceph@xxxxxxxx]
>> Sent: Tuesday, July 21, 2020 9:38 AM
>> To: ceph-users@xxxxxxx
>> Subject:  Re: osd out vs crush reweight]
>>
>>
>> Hi Dominic,
>>
>> This cluster is running 14.2.8 (nautilus) There's 172 osds divided
>> over 19 nodes.
>> There are currently 10 pools.
>> All pools have 3 replica's of data
>> There are 3968 PG's (the cluster is not yet fully in use. The number
>> of PGs is expected to grow)
>>
>> Marcel
>>
>>> Marcel;
>>>
>>> Short answer; yes, it might be expected behavior.
>>>
>>> PG placement is highly dependent on the cluster layout, and CRUSH
>>> rules.
>>> So... Some clarifying questions.
>>>
>>> What version of Ceph are you running?
>>> How many nodes do you have?
>>> How many pools do you have, and what are their failure domains?
>>>
>>> Thank you,
>>>
>>> Dominic L. Hilsbos, MBA
>>> Director - Information Technology
>>> Perform Air International, Inc.
>>> DHilsbos@xxxxxxxxxxxxxx
>>> www.PerformAir.com
>>>
>>>
>>> -----Original Message-----
>>> From: Marcel Kuiper [mailto:ceph@xxxxxxxx]
>>> Sent: Tuesday, July 21, 2020 6:52 AM
>>> To: ceph-users@xxxxxxx
>>> Subject:  osd out vs crush reweight
>>>
>>> Hi list,
>>>
>>> I ran a test with marking an osd out versus setting its crush weight
>>> to 0.
>>> I compared to what osds pages were send. The crush map has 3 rooms.
>>> This is what happened.
>>>
>>> On ceph osd out 111 (first room; this node has osds 108 - 116) pg's
>>> were send to the following osds
>>>
>>> NR PG's   OSD
>>>       2   1
>>>       1   4
>>>       1   5
>>>       1   6
>>>       1   7
>>>       2   8
>>>       1   31
>>>       1   34
>>>       1   35
>>>       1   56
>>>       2   57
>>>       1   58
>>>       1   61
>>>       1   83
>>>       1   84
>>>       1   88
>>>       1   99
>>>       1   100
>>>       2   107
>>>       1   114
>>>       2   117
>>>       1   118
>>>       1   119
>>>       1   121
>>>
>>> All PG's were send to osds on other nodes in the same room, except
>>> for 1 PG on osd 114. I think this works as expected
>>>
>>> Now I  marked the osd in and wait until all stabilized. Then I set
>>> the crush weight to 0. ceph osd crush reweight osd.111 0. I thought
>>> this lowers the crush weight of the node so even less chances that
>>> PG's end up on an osd of the same node. However the result are
>>>
>>> NR PG's   OSD
>>>       1   61
>>>       1   83
>>>       1   86
>>>       3   108
>>>       4   109
>>>       5   110
>>>       2   112
>>>       5   113
>>>       7   114
>>>       5   115
>>>       2   116
>>>
>>> except for 3 PG's all other PG's ended up on an osd belonging to the
>>> same node :-O. Is this expected behaviour? Can someone explain?? This
>>> is on nautilus 14.2.8.
>>>
>>> Thanks
>>>
>>> Marcel
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an
>>> email to ceph-users-leave@xxxxxxx
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an
>>> email to ceph-users-leave@xxxxxxx
>>>
>>
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an
>> email to ceph-users-leave@xxxxxxx
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an
>> email to ceph-users-leave@xxxxxxx
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an
>> email to ceph-users-leave@xxxxxxx
>>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx