Re: Rebalance/Backfill Throtling - anything missing here?

Robert LeBlanc <robert@xxxxxxxxxxxxx> · Tue, 3 Mar 2015 10:34:40 -0700

I would be inclined to shut down both OSDs in a node, let the cluster
recover. Once it is recovered, shut down the next two, let it recover.
Repeat until all the OSDs are taken out of the cluster. Then I would
set nobackfill and norecover. Then remove the hosts/disks from the
CRUSH then unset nobackfill and norecover.

That should give you a few small changes (when you shut down OSDs) and
then one big one to get everything in the final place. If you are
still adding new nodes, when nobackfill and norecover is set, you can
add them in so that the one big relocate fills the new drives too.

On Tue, Mar 3, 2015 at 5:58 AM, Andrija Panic <andrija.panic@xxxxxxxxx> wrote:
> Thx Irek. Number of replicas is 3.
>
> I have 3 servers with 2 OSDs on them on 1g switch (1 OSD already
> decommissioned), which is further connected to a new 10G switch/network with
> 3 servers on it with 12 OSDs each.
> I'm decommissioning old 3 nodes on 1G network...
>
> So you suggest removing whole node with 2 OSDs manually from crush map?
> Per my knowledge, ceph never places 2 replicas on 1 node, all 3 replicas
> were originally been distributed over all 3 nodes. So anyway It could be
> safe to remove 2 OSDs at once together with the node itself...since replica
> count is 3...
> ?
>
> Thx again for your time
>
> On Mar 3, 2015 1:35 PM, "Irek Fasikhov" <malmyzh@xxxxxxxxx> wrote:
>>
>> Once you have only three nodes in the cluster.
>> I recommend you add new nodes to the cluster, and then delete the old.
>>
>> 2015-03-03 15:28 GMT+03:00 Irek Fasikhov <malmyzh@xxxxxxxxx>:
>>>
>>> You have a number of replication?
>>>
>>> 2015-03-03 15:14 GMT+03:00 Andrija Panic <andrija.panic@xxxxxxxxx>:
>>>>
>>>> Hi Irek,
>>>>
>>>> yes, stoping OSD (or seting it to OUT) resulted in only 3% of data
>>>> degraded and moved/recovered.
>>>> When I after that removed it from Crush map "ceph osd crush rm id",
>>>> that's when the stuff with 37% happened.
>>>>
>>>> And thanks Irek for help - could you kindly just let me know of the
>>>> prefered steps when removing whole node?
>>>> Do you mean I first stop all OSDs again, or just remove each OSD from
>>>> crush map, or perhaps, just decompile cursh map, delete the node completely,
>>>> compile back in, and let it heal/recover ?
>>>>
>>>> Do you think this would result in less data missplaces and moved arround
>>>> ?
>>>>
>>>> Sorry for bugging you, I really appreaciate your help.
>>>>
>>>> Thanks
>>>>
>>>> On 3 March 2015 at 12:58, Irek Fasikhov <malmyzh@xxxxxxxxx> wrote:
>>>>>
>>>>> A large percentage of the rebuild of the cluster map (But low
>>>>> percentage degradation). If you had not made "ceph osd crush rm id", the
>>>>> percentage would be low.
>>>>> In your case, the correct option is to remove the entire node, rather
>>>>> than each disk individually
>>>>>
>>>>> 2015-03-03 14:27 GMT+03:00 Andrija Panic <andrija.panic@xxxxxxxxx>:
>>>>>>
>>>>>> Another question - I mentioned here 37% of objects being moved arround
>>>>>> - this is MISPLACED object (degraded objects were 0.001%, after I removed 1
>>>>>> OSD from cursh map (out of 44 OSD or so).
>>>>>>
>>>>>> Can anybody confirm this is normal behaviour - and are there any
>>>>>> workarrounds ?
>>>>>>
>>>>>> I understand this is because of the object placement algorithm of
>>>>>> CEPH, but still 37% of object missplaces just by removing 1 OSD from crush
>>>>>> maps out of 44 make me wonder why this large percentage ?
>>>>>>
>>>>>> Seems not good to me, and I have to remove another 7 OSDs (we are
>>>>>> demoting some old hardware nodes). This means I can potentialy go with 7 x
>>>>>> the same number of missplaced objects...?
>>>>>>
>>>>>> Any thoughts ?
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>> On 3 March 2015 at 12:14, Andrija Panic <andrija.panic@xxxxxxxxx>
>>>>>> wrote:
>>>>>>>
>>>>>>> Thanks Irek.
>>>>>>>
>>>>>>> Does this mean, that after peering for each PG, there will be delay
>>>>>>> of 10sec, meaning that every once in a while, I will have 10sec od the
>>>>>>> cluster NOT being stressed/overloaded, and then the recovery takes place for
>>>>>>> that PG, and then another 10sec cluster is fine, and then stressed again ?
>>>>>>>
>>>>>>> I'm trying to understand process before actually doing stuff (config
>>>>>>> reference is there on ceph.com but I don't fully understand the process)
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Andrija
>>>>>>>
>>>>>>> On 3 March 2015 at 11:32, Irek Fasikhov <malmyzh@xxxxxxxxx> wrote:
>>>>>>>>
>>>>>>>> Hi.
>>>>>>>>
>>>>>>>> Use value "osd_recovery_delay_start"
>>>>>>>> example:
>>>>>>>> [root@ceph08 ceph]# ceph --admin-daemon
>>>>>>>> /var/run/ceph/ceph-osd.94.asok config show  | grep osd_recovery_delay_start
>>>>>>>>   "osd_recovery_delay_start": "10"
>>>>>>>>
>>>>>>>> 2015-03-03 13:13 GMT+03:00 Andrija Panic <andrija.panic@xxxxxxxxx>:
>>>>>>>>>
>>>>>>>>> HI Guys,
>>>>>>>>>
>>>>>>>>> I yesterday removed 1 OSD from cluster (out of 42 OSDs), and it
>>>>>>>>> caused over 37% od the data to rebalance - let's say this is fine (this is
>>>>>>>>> when I removed it frm Crush Map).
>>>>>>>>>
>>>>>>>>> I'm wondering - I have previously set some throtling mechanism, but
>>>>>>>>> during first 1h of rebalancing, my rate of recovery was going up to 1500
>>>>>>>>> MB/s - and VMs were unusable completely, and then last 4h of the duration of
>>>>>>>>> recover this recovery rate went down to, say, 100-200 MB.s and during this
>>>>>>>>> VM performance was still pretty impacted, but at least I could work more or
>>>>>>>>> a less
>>>>>>>>>
>>>>>>>>> So my question, is this behaviour expected, is throtling here
>>>>>>>>> working as expected, since first 1h was almoust no throtling applied if I
>>>>>>>>> check the recovery rate 1500MB/s and the impact on Vms.
>>>>>>>>> And last 4h seemed pretty fine (although still lot of impact in
>>>>>>>>> general)
>>>>>>>>>
>>>>>>>>> I changed these throtling on the fly with:
>>>>>>>>>
>>>>>>>>> ceph tell osd.* injectargs '--osd_recovery_max_active 1'
>>>>>>>>> ceph tell osd.* injectargs '--osd_recovery_op_priority 1'
>>>>>>>>> ceph tell osd.* injectargs '--osd_max_backfills 1'
>>>>>>>>>
>>>>>>>>> My Jorunals are on SSDs (12 OSD per server, of which 6 journals on
>>>>>>>>> one SSD, 6 journals on another SSD)  - I have 3 of these hosts.
>>>>>>>>>
>>>>>>>>> Any thought are welcome.
>>>>>>>>> --
>>>>>>>>>
>>>>>>>>> Andrija Panić
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> ceph-users mailing list
>>>>>>>>> ceph-users@xxxxxxxxxxxxxx
>>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> С уважением, Фасихов Ирек Нургаязович
>>>>>>>> Моб.: +79229045757
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>>
>>>>>>> Andrija Panić
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>>
>>>>>> Andrija Panić
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> С уважением, Фасихов Ирек Нургаязович
>>>>> Моб.: +79229045757
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Andrija Panić
>>>
>>>
>>>
>>>
>>> --
>>> С уважением, Фасихов Ирек Нургаязович
>>> Моб.: +79229045757
>>
>>
>>
>>
>> --
>> С уважением, Фасихов Ирек Нургаязович
>> Моб.: +79229045757
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com