Re: Ceph cluster not recover after OSD down

Andres Rojas Guerrero <a.rojas@xxxxxxx> · Thu, 6 May 2021 14:03:08 +0200

I have this error when try to show mappings with crushtool:

# crushtool -i crush_map_new --test --rule 2 --num-rep 7 --show-mappings
CRUSH rule 2 x 0 [-5,-45,-49,-47,-43,-41,-29]
*** Caught signal (Segmentation fault) **
 in thread 7f7f7a0ccb40 thread_name:crushtool

El 6/5/21 a las 13:47, Eugen Block escribió:
> Yes it is possible, but you should validate it with crushtool before
> injecting it to make sure the PGs land where they belong.
> 
> crushtool -i crushmap.bin --test --rule 2 --num-rep 7 --show-mappings
> crushtool -i crushmap.bin --test --rule 2 --num-rep 7 --show-bad-mappings
> 
> If you don't get bad mappings and the 'show-mappings' confirms the PG
> distribution by host you can inject it. But be aware of a lot of data
> movement, that could explain the (temporarily) unavailable PGs. But to
> make your cluster resilient against host failure you'll have to go
> through that at some point.
> 
> 
> https://docs.ceph.com/en/latest/rados/operations/crush-map-edits/
> 
> 
> Zitat von Andres Rojas Guerrero <a.rojas@xxxxxxx>:
> 
>> Hi, I try to make a new crush rule (Nautilus) in order take the new
>> correct_failure_domain to hosts:
>>
>>    "rule_id": 2,
>>         "rule_name": "nxtcloudAFhost",
>>         "ruleset": 2,
>>         "type": 3,
>>         "min_size": 3,
>>         "max_size": 7,
>>         "steps": [
>>             {
>>                 "op": "set_chooseleaf_tries",
>>                 "num": 5
>>             },
>>             {
>>                 "op": "set_choose_tries",
>>                 "num": 100
>>             },
>>             {
>>                 "op": "take",
>>                 "item": -1,
>>                 "item_name": "default"
>>             },
>>             {
>>                 "op": "choose_indep",
>>                 "num": 0,
>>                 "type": "host"
>>             },
>>             {
>>                 "op": "emit"
>>
>> And I have changed the pool to this new crush rule:
>>
>> # ceph osd pool set nxtcloudAF crush_rule nxtcloudAFhost
>>
>> But suddenly the cephfs it's unavailable:
>>
>> # ceph status
>>   cluster:
>>     id:     c74da5b8-3d1b-483e-8b3a-739134db6cf8
>>     health: HEALTH_WARN
>>             11 clients failing to respond to capability release
>>             2 MDSs report slow metadata IOs
>>             1 MDSs report slow requests
>>
>>
>> And clients failing to respond:
>>
>> HEALTH_WARN 11 clients failing to respond to capability release; 2 MDSs
>> report slow metadata IOs; 1 MDSs report slow requests
>> MDS_CLIENT_LATE_RELEASE 11 clients failing to respond to capability
>> release
>>     mdsceph2mon03(mds.1): Client nxtcl3: failing to respond to
>> capability release client_id: 1524269
>>     mdsceph2mon01(mds.0): Client nxtcl5:nxtclproAF failing to respond to
>>
>>
>> I reversed the change, returning to the original crush rule, and all
>> it's Ok. My question if it's possible to change on fly the crush rule of
>> a EC pool.
>>
>>
>> Thanks
>> El 5/5/21 a las 18:14, Andres Rojas Guerrero escribió:
>>> Thanks, I will test it.
>>>
>>> El 5/5/21 a las 16:37, Joachim Kraftmayer escribió:
>>>> Create a new crush rule with the correct failure domain, test it
>>>> properly and assign it to the pool(s).
>>>>
>>>
>>
>> -- 
>> *******************************************************
>> Andrés Rojas Guerrero
>> Unidad Sistemas Linux
>> Area Arquitectura Tecnológica
>> Secretaría General Adjunta de Informática
>> Consejo Superior de Investigaciones Científicas (CSIC)
>> Pinar 19
>> 28006 - Madrid
>> Tel: +34 915680059 -- Ext. 990059
>> email: a.rojas@xxxxxxx
>> ID comunicate.csic.es: @50852720l:matrix.csic.es
>> *******************************************************
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> 
> 
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx

-- 
*******************************************************
Andrés Rojas Guerrero
Unidad Sistemas Linux
Area Arquitectura Tecnológica
Secretaría General Adjunta de Informática
Consejo Superior de Investigaciones Científicas (CSIC)
Pinar 19
28006 - Madrid
Tel: +34 915680059 -- Ext. 990059
email: a.rojas@xxxxxxx
ID comunicate.csic.es: @50852720l:matrix.csic.es
*******************************************************
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx