Adding datacenter level to CRUSH tree causes rebalancing

Niklas Hambüchen <mail@xxxxxx> · Sat, 15 Jul 2023 20:02:45 +0200

Hi Ceph users,

I have a Ceph 16.2.7 cluster that so far has been replicated over the `host` failure domain.
All `hosts` have been chosen to be in different `datacenter`s, so that was sufficient.

Now I wish to add more hosts, including some in already-used data centers, so I'm planning to use CRUSH's `datacenter` failure domain instead.

My problem is that when I add the `datacenter`s into the CRUSH tree, Ceph decides that it should now rebalance the entire cluster.
This seems unnecessary, and wrong.

Before, `ceph osd tree` (some OSDs omitted for legibility):

    ID   CLASS  WEIGHT     TYPE NAME                STATUS  REWEIGHT  PRI-AFF
     -1         440.73514  root default
     -3         146.43625      host node-4
      2    hdd   14.61089          osd.2                up   1.00000  1.00000
      3    hdd   14.61089          osd.3                up   1.00000  1.00000
     -7         146.43625      host node-5
     14    hdd   14.61089          osd.14               up   1.00000  1.00000
     15    hdd   14.61089          osd.15               up   1.00000  1.00000
    -10         146.43625      host node-6
     26    hdd   14.61089          osd.26               up   1.00000  1.00000
     27    hdd   14.61089          osd.27               up   1.00000  1.00000

After assigning of `datacenter` crush buckets:

    ID   CLASS  WEIGHT     TYPE NAME                    STATUS  REWEIGHT  PRI-AFF
     -1         440.73514  root default
    -18         146.43625      datacenter FSN-DC16
     -7         146.43625          host node-5
     14    hdd   14.61089              osd.14               up   1.00000  1.00000
     15    hdd   14.61089              osd.15               up   1.00000  1.00000
    -17         146.43625      datacenter FSN-DC18
    -10         146.43625          host node-6
     26    hdd   14.61089              osd.26               up   1.00000  1.00000
     27    hdd   14.61089              osd.27               up   1.00000  1.00000
    -16         146.43625      datacenter FSN-DC4
     -3         146.43625          host node-4
      2    hdd   14.61089              osd.2                up   1.00000  1.00000
      3    hdd   14.61089              osd.3                up   1.00000  1.00000

This shows that the tree is essentially unchanged, it just "gained a level".

In `ceph status` I now get:

    pgs:     1167541260/1595506041 objects misplaced (73.177%)

If I remove the `datacenter` level again, then the misplacement disappears.

On a minimal testing cluster, this misplacement issue did not appear.

Why does Ceph think that these objects are misplaced when I add the datacenter level?
Is there a more correct way to do this?

Thanks!
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx