Re: Rebalance/Backfill Throtling - anything missing here?

Andrija Panic <andrija.panic@xxxxxxxxx> · Thu, 5 Mar 2015 23:16:28 +0100

Thanks a lot Robert.
I have actually already tried folowing:

a) set one OSD to out (6% of data misplaced, CEPH recovered fine), stop OSD, remove OSD from crush map (again 36% of data misplaced !!!) - then inserted OSD back in to crushmap - and those 36% displaced objects disappeared, of course - I'v undone the crush remove...
so damage undone - the OSD is just "out" and cluster healthy again.

b) set norecover, nobackfill, and then:
    - Remove one OSD from crush (the running OSD, not the one from point a) - only 18% of data misplaced !!! (no recovery was happening though, because of norecover, nobackfill)
    - Removed another OSD from same node - total of only 20% of objects missplaced (with 2 OSDs on same node, removed from crush map)
    -So these 2 OSD were still running UP and IN, and I just removed them from crush map, per the advice to avoid calcualting Crush map twice = from: http://image.slidesharecdn.com/scalingcephatcern-140311134847-phpapp01/95/scaling-ceph-at-cern-ceph-day-frankfurt-19-638.jpg?cb=1394564547
- And I added back this 2 OSD to crush map, this was just a test...

So the algorith is very funny in some aspect..but it's all pseudo stuff so I kind of understand...

I will share my finding during the rest of the OSD demotion, after I demote them...

Thanks for your detailed inputs !
Andrija

On 5 March 2015 at 22:51, Robert LeBlanc <robert@xxxxxxxxxxxxx> wrote:
Setting an OSD out will start the rebalance with the degraded object count. The OSD is still alive and can participate in the relocation of the objects. This is preferable so that you don't happen to get less the min_size because a disk fails during the rebalance then I/O stops on the cluster.
Because CRUSH is an algorithm, anything that changes algorithm will cause a change in the output (location). When you set/fail an OSD, it changes the CRUSH, but the host and weight of the host are still in effect. When you remove the host or change the weight of the host (by removing a single OSD), it makes a change to the algorithm which will also cause some changes in how it computes the locations.

Disclaimer - I have not tried this

It may be possible to minimize the data movement by doing the following:
set norecover and nobackfill on the cluster
Set the OSDs to be removed to "out"
Adjust the weight of the hosts in the CRUSH (if removing all OSDs for the host, set it to zero)
If you have new OSDs to add, add them into the cluster now
Once all OSDs changes have been entered, unset norecover and nobackfill
This will migrate the data off the old OSDs and onto the new OSDs in one swoop.
Once the data migration is complete, set norecover and nobackfill on the cluster again.
Remove the old OSDs
Unset norecover and nobackfill
The theory is that by setting the host weights to 0, removing the OSDs/hosts later should minimize the data movement afterwards because the algorithm should have already dropped it out as a candidate for placement.

If this works right, then you basically queue up a bunch of small changes, do one data movement, always keep all copies of your objects online and minimize the impact of the data movement by leveraging both your old and new hardware at the same time.

If you try this, please report back on your experience. I'm might try it in my lab, but I'm really busy at the moment so I don't know if I'll get to it real soon.

On Thu, Mar 5, 2015 at 12:53 PM, Andrija Panic <andrija.panic@xxxxxxxxx> wrote:
Hi Robert,
it seems I have not listened well on your advice - I set osd to out, instead of stoping it - and now instead of some ~ 3% of degraded objects, now there is 0.000% of degraded, and arround 6% misplaced - and rebalancing is happening again, but this is small percentage..

Do you know if later when I remove this OSD from crush map - no more data will be rebalanced (as per CEPH official documentation) - since already missplaced objects are geting distributed away to all other nodes ?

(after service ceph stop osd.0 - there was 2.45% degraded data - but no backfilling was happening for some reason...it just stayed degraded... so this is a reason why I started back the OSD, and then set it to out...)

Thanks

On 4 March 2015 at 17:54, Andrija Panic <andrija.panic@xxxxxxxxx> wrote:
Hi Robert,
I already have this stuff set. CEph is 0.87.0 now...

Thanks, will schedule this for weekend, 10G network and 36 OSDs - should move data in less than 8h per my last experineced that was arround8h, but some 1G OSDs were included...

Thx!

On 4 March 2015 at 17:49, Robert LeBlanc <robert@xxxxxxxxxxxxx> wrote:
You will most likely have a very high relocation percentage. Backfills

always are more impactful on smaller clusters, but "osd max backfills"

should be what you need to help reduce the impact. The default is 10,

you will want to use 1.

I didn't catch which version of Ceph you are running, but I think

there was some priority work done in firefly to help make backfills

lower priority. I think it has gotten better in later versions.

On Wed, Mar 4, 2015 at 1:35 AM, Andrija Panic <andrija.panic@xxxxxxxxx> wrote:

> Thank you Rober - I'm wondering when I do remove total of 7 OSDs from crush

> map - weather that will cause more than 37% of data moved (80% or whatever)

>

> I'm also wondering if the thortling that I applied is fine or not - I will

> introduce the osd_recovery_delay_start 10sec as Irek said.

>

> I'm just wondering hom much will be the performance impact, because:

> - when stoping OSD, the impact while backfilling was fine more or a less - I

> can leave with this

> - when I removed OSD from cursh map - first 1h or so, impact was tremendous,

> and later on during recovery process impact was much less but still

> noticable...

>

> Thanks for the tip of course !

> Andrija

>

> On 3 March 2015 at 18:34, Robert LeBlanc <robert@xxxxxxxxxxxxx> wrote:

>>

>> I would be inclined to shut down both OSDs in a node, let the cluster

>> recover. Once it is recovered, shut down the next two, let it recover.

>> Repeat until all the OSDs are taken out of the cluster. Then I would

>> set nobackfill and norecover. Then remove the hosts/disks from the

>> CRUSH then unset nobackfill and norecover.

>>

>> That should give you a few small changes (when you shut down OSDs) and

>> then one big one to get everything in the final place. If you are

>> still adding new nodes, when nobackfill and norecover is set, you can

>> add them in so that the one big relocate fills the new drives too.

>>

>> On Tue, Mar 3, 2015 at 5:58 AM, Andrija Panic <andrija.panic@xxxxxxxxx>

>> wrote:

>> > Thx Irek. Number of replicas is 3.

>> >

>> > I have 3 servers with 2 OSDs on them on 1g switch (1 OSD already

>> > decommissioned), which is further connected to a new 10G switch/network

>> > with

>> > 3 servers on it with 12 OSDs each.

>> > I'm decommissioning old 3 nodes on 1G network...

>> >

>> > So you suggest removing whole node with 2 OSDs manually from crush map?

>> > Per my knowledge, ceph never places 2 replicas on 1 node, all 3 replicas

>> > were originally been distributed over all 3 nodes. So anyway It could be

>> > safe to remove 2 OSDs at once together with the node itself...since

>> > replica

>> > count is 3...

>> > ?

>> >

>> > Thx again for your time

>> >

>> > On Mar 3, 2015 1:35 PM, "Irek Fasikhov" <malmyzh@xxxxxxxxx> wrote:

>> >>

>> >> Once you have only three nodes in the cluster.

>> >> I recommend you add new nodes to the cluster, and then delete the old.

>> >>

>> >> 2015-03-03 15:28 GMT+03:00 Irek Fasikhov <malmyzh@xxxxxxxxx>:

>> >>>

>> >>> You have a number of replication?

>> >>>

>> >>> 2015-03-03 15:14 GMT+03:00 Andrija Panic <andrija.panic@xxxxxxxxx>:

>> >>>>

>> >>>> Hi Irek,

>> >>>>

>> >>>> yes, stoping OSD (or seting it to OUT) resulted in only 3% of data

>> >>>> degraded and moved/recovered.

>> >>>> When I after that removed it from Crush map "ceph osd crush rm id",

>> >>>> that's when the stuff with 37% happened.

>> >>>>

>> >>>> And thanks Irek for help - could you kindly just let me know of the

>> >>>> prefered steps when removing whole node?

>> >>>> Do you mean I first stop all OSDs again, or just remove each OSD from

>> >>>> crush map, or perhaps, just decompile cursh map, delete the node

>> >>>> completely,

>> >>>> compile back in, and let it heal/recover ?

>> >>>>

>> >>>> Do you think this would result in less data missplaces and moved

>> >>>> arround

>> >>>> ?

>> >>>>

>> >>>> Sorry for bugging you, I really appreaciate your help.

>> >>>>

>> >>>> Thanks

>> >>>>

>> >>>> On 3 March 2015 at 12:58, Irek Fasikhov <malmyzh@xxxxxxxxx> wrote:

>> >>>>>

>> >>>>> A large percentage of the rebuild of the cluster map (But low

>> >>>>> percentage degradation). If you had not made "ceph osd crush rm id",

>> >>>>> the

>> >>>>> percentage would be low.

>> >>>>> In your case, the correct option is to remove the entire node,

>> >>>>> rather

>> >>>>> than each disk individually

>> >>>>>

>> >>>>> 2015-03-03 14:27 GMT+03:00 Andrija Panic <andrija.panic@xxxxxxxxx>:

>> >>>>>>

>> >>>>>> Another question - I mentioned here 37% of objects being moved

>> >>>>>> arround

>> >>>>>> - this is MISPLACED object (degraded objects were 0.001%, after I

>> >>>>>> removed 1

>> >>>>>> OSD from cursh map (out of 44 OSD or so).

>> >>>>>>

>> >>>>>> Can anybody confirm this is normal behaviour - and are there any

>> >>>>>> workarrounds ?

>> >>>>>>

>> >>>>>> I understand this is because of the object placement algorithm of

>> >>>>>> CEPH, but still 37% of object missplaces just by removing 1 OSD

>> >>>>>> from crush

>> >>>>>> maps out of 44 make me wonder why this large percentage ?

>> >>>>>>

>> >>>>>> Seems not good to me, and I have to remove another 7 OSDs (we are

>> >>>>>> demoting some old hardware nodes). This means I can potentialy go

>> >>>>>> with 7 x

>> >>>>>> the same number of missplaced objects...?

>> >>>>>>

>> >>>>>> Any thoughts ?

>> >>>>>>

>> >>>>>> Thanks

>> >>>>>>

>> >>>>>> On 3 March 2015 at 12:14, Andrija Panic <andrija.panic@xxxxxxxxx>

>> >>>>>> wrote:

>> >>>>>>>

>> >>>>>>> Thanks Irek.

>> >>>>>>>

>> >>>>>>> Does this mean, that after peering for each PG, there will be

>> >>>>>>> delay

>> >>>>>>> of 10sec, meaning that every once in a while, I will have 10sec od

>> >>>>>>> the

>> >>>>>>> cluster NOT being stressed/overloaded, and then the recovery takes

>> >>>>>>> place for

>> >>>>>>> that PG, and then another 10sec cluster is fine, and then stressed

>> >>>>>>> again ?

>> >>>>>>>

>> >>>>>>> I'm trying to understand process before actually doing stuff

>> >>>>>>> (config

>> >>>>>>> reference is there on ceph.com but I don't fully understand the

>> >>>>>>> process)

>> >>>>>>>

>> >>>>>>> Thanks,

>> >>>>>>> Andrija

>> >>>>>>>

>> >>>>>>> On 3 March 2015 at 11:32, Irek Fasikhov <malmyzh@xxxxxxxxx> wrote:

>> >>>>>>>>

>> >>>>>>>> Hi.

>> >>>>>>>>

>> >>>>>>>> Use value "osd_recovery_delay_start"

>> >>>>>>>> example:

>> >>>>>>>> [root@ceph08 ceph]# ceph --admin-daemon

>> >>>>>>>> /var/run/ceph/ceph-osd.94.asok config show  | grep

>> >>>>>>>> osd_recovery_delay_start

>> >>>>>>>>   "osd_recovery_delay_start": "10"

>> >>>>>>>>

>> >>>>>>>> 2015-03-03 13:13 GMT+03:00 Andrija Panic

>> >>>>>>>> <andrija.panic@xxxxxxxxx>:

>> >>>>>>>>>

>> >>>>>>>>> HI Guys,

>> >>>>>>>>>

>> >>>>>>>>> I yesterday removed 1 OSD from cluster (out of 42 OSDs), and it

>> >>>>>>>>> caused over 37% od the data to rebalance - let's say this is

>> >>>>>>>>> fine (this is

>> >>>>>>>>> when I removed it frm Crush Map).

>> >>>>>>>>>

>> >>>>>>>>> I'm wondering - I have previously set some throtling mechanism,

>> >>>>>>>>> but

>> >>>>>>>>> during first 1h of rebalancing, my rate of recovery was going up

>> >>>>>>>>> to 1500

>> >>>>>>>>> MB/s - and VMs were unusable completely, and then last 4h of the

>> >>>>>>>>> duration of

>> >>>>>>>>> recover this recovery rate went down to, say, 100-200 MB.s and

>> >>>>>>>>> during this

>> >>>>>>>>> VM performance was still pretty impacted, but at least I could

>> >>>>>>>>> work more or

>> >>>>>>>>> a less

>> >>>>>>>>>

>> >>>>>>>>> So my question, is this behaviour expected, is throtling here

>> >>>>>>>>> working as expected, since first 1h was almoust no throtling

>> >>>>>>>>> applied if I

>> >>>>>>>>> check the recovery rate 1500MB/s and the impact on Vms.

>> >>>>>>>>> And last 4h seemed pretty fine (although still lot of impact in

>> >>>>>>>>> general)

>> >>>>>>>>>

>> >>>>>>>>> I changed these throtling on the fly with:

>> >>>>>>>>>

>> >>>>>>>>> ceph tell osd.* injectargs '--osd_recovery_max_active 1'

>> >>>>>>>>> ceph tell osd.* injectargs '--osd_recovery_op_priority 1'

>> >>>>>>>>> ceph tell osd.* injectargs '--osd_max_backfills 1'

>> >>>>>>>>>

>> >>>>>>>>> My Jorunals are on SSDs (12 OSD per server, of which 6 journals

>> >>>>>>>>> on

>> >>>>>>>>> one SSD, 6 journals on another SSD)  - I have 3 of these hosts.

>> >>>>>>>>>

>> >>>>>>>>> Any thought are welcome.

>> >>>>>>>>> --

>> >>>>>>>>>

>> >>>>>>>>> Andrija Panić

>> >>>>>>>>>

>> >>>>>>>>> _______________________________________________

>> >>>>>>>>> ceph-users mailing list

>> >>>>>>>>> ceph-users@xxxxxxxxxxxxxx

>> >>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>> >>>>>>>>>

>> >>>>>>>>

>> >>>>>>>>

>> >>>>>>>>

>> >>>>>>>> --

>> >>>>>>>> С уважением, Фасихов Ирек Нургаязович

>> >>>>>>>> Моб.: +79229045757

>> >>>>>>>

>> >>>>>>>

>> >>>>>>>

>> >>>>>>>

>> >>>>>>> --

>> >>>>>>>

>> >>>>>>> Andrija Panić

>> >>>>>>

>> >>>>>>

>> >>>>>>

>> >>>>>>

>> >>>>>> --

>> >>>>>>

>> >>>>>> Andrija Panić

>> >>>>>

>> >>>>>

>> >>>>>

>> >>>>>

>> >>>>> --

>> >>>>> С уважением, Фасихов Ирек Нургаязович

>> >>>>> Моб.: +79229045757

>> >>>>

>> >>>>

>> >>>>

>> >>>>

>> >>>> --

>> >>>>

>> >>>> Andrija Panić

>> >>>

>> >>>

>> >>>

>> >>>

>> >>> --

>> >>> С уважением, Фасихов Ирек Нургаязович

>> >>> Моб.: +79229045757

>> >>

>> >>

>> >>

>> >>

>> >> --

>> >> С уважением, Фасихов Ирек Нургаязович

>> >> Моб.: +79229045757

>> >

>> >

>> > _______________________________________________

>> > ceph-users mailing list

>> > ceph-users@xxxxxxxxxxxxxx

>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>> >

>

>

>

>

> --

>

> Andrija Panić

-- 

Andrija Panić

-- 

Andrija Panić

-- 

Andrija Panić

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com