Problem with customized crush rule for EC pool

loic@xxxxxxxxxxx (Loic Dachary) · Wed, 10 Sep 2014 08:59:52 +0200

Right : I thought about data loss but what you're after is data availability. Thanks for explaining :-)

On 10/09/2014 04:29, Lei Dong wrote:
> Yes, My goal is to make it loosing 3 OSD does not lose data.
> 
> My 6 racks may not be in different rooms but they use 6 different
> switches, so I want when any switch is down or unreachable, my data can
> still be accessed. I think it?s not an unrealistic requirement.
> 
> 
> Thanks!
> 
> LeiDong.
> 
> On 9/9/14, 10:02 PM, "Loic Dachary" <loic at dachary.org> wrote:
> 
>>
>>
>> On 09/09/2014 14:21, Lei Dong wrote:
>>> Thanks loic!
>>>
>>> Actually I've found that increase choose_local_fallback_tries can
>>> help(chooseleaf_tries helps not so significantly), but I'm afraid when
>>> osd failure happen and need to find new acting set, it may be fail to
>>> find enough racks again. So I'm trying to find a more guaranteed way in
>>> case of osd failure.
>>>
>>> My profile is nothing special other than k=8 m=3.
>>
>> So your goal is to make it so loosing 3 OSD simultaneously does not mean
>> loosing data. By forcing each rack to hold at most 2 OSDs for a given
>> object, you make it so loosing a full rack does not mean loosing data.
>> Are these racks in the same room in the datacenter ? In the event of a
>> catastrophic failure that permanently destroy one rack, how realistic is
>> it that the other racks are unharmed ? If the rack is destroyed by fire
>> and is in a row with the six other racks, there is a very high chance
>> that the other racks will also be damaged. Note that I am not a system
>> architect nor a system administrator : I may be completely wrong ;-) If
>> it turns out that the probability of a single rack to fail entirely and
>> independently of the others is negligible, it may not be necessary to
>> make a complex ruleset and instead use the default ruleset.
>>
>> My 2cts
>>
>>>
>>> Thanks again!
>>>
>>> Leidong
>>>
>>>
>>>
>>>
>>>
>>>> On 2014?9?9?, at ??7:53, "Loic Dachary" <loic at dachary.org> wrote:
>>>>
>>>> Hi,
>>>>
>>>> It is indeed possible that mapping fails if there are just enough
>>>> racks to match the constraint. And the probability of a bad mapping
>>>> increases when the number of PG increases because there is a need for
>>>> more mapping. You can tell crush to try harder with
>>>>
>>>> step set_chooseleaf_tries 10
>>>>
>>>> Be careful though : increasing this number will change mapping. It
>>>> will not just fix the bad mappings you're seeing, it will also change
>>>> the mappings that succeeded with a lower value. Once you've set this
>>>> parameter, it cannot be modified.
>>>>
>>>> Would you mind sharing the erasure code profile you plan to work with ?
>>>>
>>>> Cheers
>>>>
>>>>> On 09/09/2014 12:39, Lei Dong wrote:
>>>>> Hi ceph users:
>>>>>
>>>>> I want to create a customized crush rule for my EC pool (with
>>>>> replica_size = 11) to distribute replicas into 6 different Racks.
>>>>>
>>>>> I use the following rule at first:
>>>>>
>>>>> Step take default  // root
>>>>> Step choose firstn 6 type rack// 6 racks, I have and only have 6 racks
>>>>> Step chooseleaf indep 2 type osd // 2 osds per rack
>>>>> Step emit
>>>>>
>>>>> I looks fine and works fine when PG num is small.
>>>>> But when pg num increase, there are always some PGs which can not
>>>>> take all the 6 racks.
>>>>> It looks like ?Step choose firstn 6 type rack? sometimes returns only
>>>>> 5 racks.
>>>>> After some investigation,  I think it may caused by collision of
>>>>> choices.
>>>>>
>>>>> Then I come up with another solution to solve collision like this:
>>>>>
>>>>> Step take rack0
>>>>> Step chooseleaf indep 2 type osd
>>>>> Step emit
>>>>> Step take rack1
>>>>> ?.
>>>>> (manually take every rack)
>>>>>
>>>>> This won?t cause rack collision, because I specify rack by name at
>>>>> first. But the problem is that osd in rack0 will always be primary osd
>>>>> because I choose from rack0 first.
>>>>>
>>>>> So the question is what is the recommended way to meet such a need
>>>>> (distribute 11 replicas into 6 racks evenly in case of rack failure)?
>>>>>
>>>>>
>>>>> Thanks!
>>>>> LeiDong
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> ceph-users mailing list
>>>>> ceph-users at lists.ceph.com
>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>
>>>> -- 
>>>> Lo?c Dachary, Artisan Logiciel Libre
>>>>
>>
>> -- 
>> Lo?c Dachary, Artisan Logiciel Libre
>>
> 

-- 
Lo?c Dachary, Artisan Logiciel Libre

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 263 bytes
Desc: OpenPGP digital signature
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140910/cf982efd/attachment.pgp>