Re: backfilling after OSD marked out _and_ OSD removed

Gregory Farnum <greg@xxxxxxxxxxx> · Thu, 9 Jan 2014 10:57:52 -0800

On Thu, Jan 9, 2014 at 6:27 AM, Dan Van Der Ster
<daniel.vanderster@xxxxxxx> wrote:
> Here’s a more direct question. Given this osd tree:
>
> # ceph osd tree  |head
> # id    weight  type name       up/down reweight
> -1      2952    root default
> -2      2952            room 0513-R-0050
> -3      262.1                   rack RJ35
> ...
> -14     135.8                   rack RJ57
> -51     0                               host p05151113781242
> -52     5.46                            host p05151113782262
> 1036    2.73                                    osd.1036        DNE
> 1037    2.73                                    osd.1037        DNE
> ...
>
>
> If I do
>
>    ceph osd crush rm osd.1036
>
> or even
>
>   ceph osd crush reweight osd.1036 2.5
>
> it is going to result in some backfilling. Why?

Yeah, this (and the more specific one you saw with removing OSDs) is
just an unfortunate consequence of CRUSH's hierarchical weights. When
you reweight or remove an OSD, you are changing the weight of the
buckets which contain (it, host, rack, room, etc). That slightly
changes the calculated data placements. Marking an OSD out does not
change the containing bucket weights.

We could change that, but it has a bunch of fiddly consequences
elsewhere (removing OSDs becomes a less local recovery consequence; if
you replace a drive you still have to go through non-local recovery,
etc) and we haven't yet come up with an UX that we actually like
around this work flow, so the existing behavior wins by default.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com

>
> Cheers, Dan
>
> On 09 Jan 2014, at 12:11, Dan van der Ster <daniel.vanderster@xxxxxxx> wrote:
>
>> Hi,
>> I’m slightly confused about one thing we are observing at the moment. We’re testing the shutdown/removal of OSD servers and noticed twice as much backfilling as expected. This is what we did:
>>
>> 1. service ceph stop on some OSD servers.
>> 2. ceph osd out for the above OSDs (to avoid waiting for the down to out timeout)
>> — at this point, backfilling begins and finishes successfully after some time.
>> 3. ceph osd rm all of the above OSDs (leaves OSDs in the crush table, marked DNE)
>> 4. ceph osd crush rm for each of the above OSDs
>> — step 4 triggers another rebalancing!! despite there not being any data on those OSDs and all PGs being previously healthy.
>>
>> Is this expected? Is there a way to avoid the 2nd rebalance?
>>
>> Best Regards,
>> Dan van der Ster
>> CERN IT
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com