Based on my understanding of CRUSH it basically works down the hierarchy and then randomly (but deterministically for a given CRUSH map) picks buckets (based on the specific selection rule) on that level for the object and then it does this recursively until it ends up at the leaf nodes. Given that you introduced a whole hierarchy level just below the top, objects will now be distributed differently since the pseudo-random hash-based selection strategy may now for example put an object that used to be in node-4 under FSN-DC16 instead So basically when you fiddle with the hierarchy you can generally expect lots of data movement everywhere downstream of your change. On Sun, 16 Jul 2023 at 06:03, Niklas Hambüchen <mail@xxxxxx> wrote: > Hi Ceph users, > > I have a Ceph 16.2.7 cluster that so far has been replicated over the > `host` failure domain. > All `hosts` have been chosen to be in different `datacenter`s, so that was > sufficient. > > Now I wish to add more hosts, including some in already-used data centers, > so I'm planning to use CRUSH's `datacenter` failure domain instead. > > My problem is that when I add the `datacenter`s into the CRUSH tree, Ceph > decides that it should now rebalance the entire cluster. > This seems unnecessary, and wrong. > > Before, `ceph osd tree` (some OSDs omitted for legibility): > > > ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT > PRI-AFF > -1 440.73514 root default > -3 146.43625 host node-4 > 2 hdd 14.61089 osd.2 up 1.00000 > 1.00000 > 3 hdd 14.61089 osd.3 up 1.00000 > 1.00000 > -7 146.43625 host node-5 > 14 hdd 14.61089 osd.14 up 1.00000 > 1.00000 > 15 hdd 14.61089 osd.15 up 1.00000 > 1.00000 > -10 146.43625 host node-6 > 26 hdd 14.61089 osd.26 up 1.00000 > 1.00000 > 27 hdd 14.61089 osd.27 up 1.00000 > 1.00000 > > > After assigning of `datacenter` crush buckets: > > > ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT > PRI-AFF > -1 440.73514 root default > -18 146.43625 datacenter FSN-DC16 > -7 146.43625 host node-5 > 14 hdd 14.61089 osd.14 up 1.00000 > 1.00000 > 15 hdd 14.61089 osd.15 up 1.00000 > 1.00000 > -17 146.43625 datacenter FSN-DC18 > -10 146.43625 host node-6 > 26 hdd 14.61089 osd.26 up 1.00000 > 1.00000 > 27 hdd 14.61089 osd.27 up 1.00000 > 1.00000 > -16 146.43625 datacenter FSN-DC4 > -3 146.43625 host node-4 > 2 hdd 14.61089 osd.2 up 1.00000 > 1.00000 > 3 hdd 14.61089 osd.3 up 1.00000 > 1.00000 > > > This shows that the tree is essentially unchanged, it just "gained a > level". > > In `ceph status` I now get: > > pgs: 1167541260/1595506041 objects misplaced (73.177%) > > If I remove the `datacenter` level again, then the misplacement disappears. > > On a minimal testing cluster, this misplacement issue did not appear. > > Why does Ceph think that these objects are misplaced when I add the > datacenter level? > Is there a more correct way to do this? > > > Thanks! > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx