Re: Ceph Rebalance Issue

Roozbeh Shafiee <roozbeh.shafiee@xxxxxxxxx> · Sun, 3 Jul 2016 14:04:17 +0430

Actually I tried all the ways which I found them on Ceph Docs and mailing lists but
non of them had no effect. As a last resort I changed pg/pgp.

Anyway… What can I do as the best way to solve this problem?

Thanks

> On Jul 3, 2016, at 1:43 PM, Wido den Hollander <wido@xxxxxxxx> wrote:
> 
> 
>> Op 3 juli 2016 om 11:02 schreef Roozbeh Shafiee <roozbeh.shafiee@xxxxxxxxx>:
>> 
>> 
>> Yes, you’re right but I have 0 object/s recovery last night. when I changed pg/pgp from 1400
>> to 2048, rebalancing speeded up but the percentage of rebalancing backed to 53%.
>> 
> 
> Why did you change that? I would not change that value while a cluster is still in recovery.
> 
>> I have this situation again n again since I dropped out failed OSD when I increase pg/pgp but 
>> each time rebalancing stopped at 0 objects/s and low speed transfer.
>> 
> 
> Hard to judge at this point. You might want to try and restart osd.27 and see if that gets things going again. It seems to be involved in many PGs which are in 'backfilling' state.
> 
> Wido
> 
>> Thanks
>> 
>>> On Jul 3, 2016, at 1:25 PM, Wido den Hollander <wido@xxxxxxxx> wrote:
>>> 
>>> 
>>>> Op 3 juli 2016 om 10:50 schreef Roozbeh Shafiee <roozbeh.shafiee@xxxxxxxxx>:
>>>> 
>>>> 
>>>> Thanks for quick response, Wido
>>>> 
>>>> the "ceph -s" output has pasted here:
>>>> http://pastie.org/10897747
>>>> 
>>>> and this is output of “ceph health detail”:
>>>> http://pastebin.com/vMeURWC9
>>>> 
>>> 
>>> It seems the cluster is still backfilling PGs and you 'ceph -s' shows so: 'recovery io 62375 kB/s, 15 objects/s'
>>> 
>>> It will just take some time before it finishes.
>>> 
>>> Wido
>>> 
>>>> Thank you
>>>> 
>>>>> On Jul 3, 2016, at 1:10 PM, Wido den Hollander <wido@xxxxxxxx> wrote:
>>>>> 
>>>>> 
>>>>>> Op 3 juli 2016 om 10:34 schreef Roozbeh Shafiee <roozbeh.shafiee@xxxxxxxxx>:
>>>>>> 
>>>>>> 
>>>>>> Hi list,
>>>>>> 
>>>>>> A few days ago one of my OSDs failed and I dropped out that but afterwards I got
>>>>>> HEALTH_WARN until now. After turing off the OSD, the self-healing system started
>>>>>> to rebalance data between other OSDs.
>>>>>> 
>>>>>> My question is: At the end of rebalancing, the process doesn’t complete and I get this message
>>>>>> at the end of “ceph -s” output:
>>>>>> 
>>>>>> recovery io 1456 KB/s, 0 object/s
>>>>>> 
>>>>> 
>>>>> Could you post the exact output of 'ceph -s'?
>>>>> 
>>>>> There is something more which needs to be shown.
>>>>> 
>>>>> 'ceph health detail' also might tell you more.
>>>>> 
>>>>> Wido
>>>>> 
>>>>>> how can I back to HEALTH_OK situation again?
>>>>>> 
>>>>>> My cluster details are:
>>>>>> 
>>>>>> - 27 OSDs
>>>>>> - 3 MONs
>>>>>> - 2048 pg/pgs
>>>>>> - Each OSD has 4 TB of space
>>>>>> - CentOS 7.2 with 3.10 linux kernel
>>>>>> - Ceph Hammer version
>>>>>> 
>>>>>> Thank you,
>>>>>> Roozbeh_______________________________________________
>>>>>> ceph-users mailing list
>>>>>> ceph-users@xxxxxxxxxxxxxx
>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>> 
>> 

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com