Re: Adding datacenter level to CRUSH tree causes rebalancing

Christian Wuerdig <christian.wuerdig@xxxxxxxxx> · Mon, 17 Jul 2023 11:26:06 +1200

Based on my understanding of CRUSH it basically works down the hierarchy
and then randomly (but deterministically for a given CRUSH map) picks
buckets (based on the specific selection rule) on that level for the object
and then it does this recursively until it ends up at the leaf nodes.
Given that you introduced a whole hierarchy level just below the top,
objects will now be distributed differently since the pseudo-random
hash-based selection strategy may now for example put an object that used
to be in node-4 under FSN-DC16 instead
So basically when you fiddle with the hierarchy you can generally expect
lots of data movement everywhere downstream of your change.

On Sun, 16 Jul 2023 at 06:03, Niklas Hambüchen <mail@xxxxxx> wrote:

> Hi Ceph users,
>
> I have a Ceph 16.2.7 cluster that so far has been replicated over the
> `host` failure domain.
> All `hosts` have been chosen to be in different `datacenter`s, so that was
> sufficient.
>
> Now I wish to add more hosts, including some in already-used data centers,
> so I'm planning to use CRUSH's `datacenter` failure domain instead.
>
> My problem is that when I add the `datacenter`s into the CRUSH tree, Ceph
> decides that it should now rebalance the entire cluster.
> This seems unnecessary, and wrong.
>
> Before, `ceph osd tree` (some OSDs omitted for legibility):
>
>
>      ID   CLASS  WEIGHT     TYPE NAME                STATUS  REWEIGHT
> PRI-AFF
>       -1         440.73514  root default
>       -3         146.43625      host node-4
>        2    hdd   14.61089          osd.2                up   1.00000
> 1.00000
>        3    hdd   14.61089          osd.3                up   1.00000
> 1.00000
>       -7         146.43625      host node-5
>       14    hdd   14.61089          osd.14               up   1.00000
> 1.00000
>       15    hdd   14.61089          osd.15               up   1.00000
> 1.00000
>      -10         146.43625      host node-6
>       26    hdd   14.61089          osd.26               up   1.00000
> 1.00000
>       27    hdd   14.61089          osd.27               up   1.00000
> 1.00000
>
>
> After assigning of `datacenter` crush buckets:
>
>
>      ID   CLASS  WEIGHT     TYPE NAME                    STATUS  REWEIGHT
> PRI-AFF
>       -1         440.73514  root default
>      -18         146.43625      datacenter FSN-DC16
>       -7         146.43625          host node-5
>       14    hdd   14.61089              osd.14               up   1.00000
> 1.00000
>       15    hdd   14.61089              osd.15               up   1.00000
> 1.00000
>      -17         146.43625      datacenter FSN-DC18
>      -10         146.43625          host node-6
>       26    hdd   14.61089              osd.26               up   1.00000
> 1.00000
>       27    hdd   14.61089              osd.27               up   1.00000
> 1.00000
>      -16         146.43625      datacenter FSN-DC4
>       -3         146.43625          host node-4
>        2    hdd   14.61089              osd.2                up   1.00000
> 1.00000
>        3    hdd   14.61089              osd.3                up   1.00000
> 1.00000
>
>
> This shows that the tree is essentially unchanged, it just "gained a
> level".
>
> In `ceph status` I now get:
>
>      pgs:     1167541260/1595506041 objects misplaced (73.177%)
>
> If I remove the `datacenter` level again, then the misplacement disappears.
>
> On a minimal testing cluster, this misplacement issue did not appear.
>
> Why does Ceph think that these objects are misplaced when I add the
> datacenter level?
> Is there a more correct way to do this?
>
>
> Thanks!
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx