Re: Adding datacenter level to CRUSH tree causes rebalancing

Michel Jouvin <michel.jouvin@xxxxxxxxxxxxxxx> · Sun, 16 Jul 2023 10:19:23 +0200

Hi Niklas,

I am not sure why you are surprised. In a large cluster, you should 
expect some rebalancing on every crush map or crush map rule change. 
Ceph doesn't just enforce the failure domain, it also whants to have a 
"perfect" pseudo-random distribution across the clusters based on the 
crush map hierarchy and the rules you provided. Rebalancing in itself is 
not a problem, apart from adding some load to your cluster. As you are 
running Pacific, you need to tune properly your osd_max_backfills limit 
so that the rebalancing is fast enough and doesn't stress too much your 
cluster. Since Quincy there is a new scheduler that doesn't rely on 
these settings but tries to achieve fairness at the OSD level between 
the different type of loads and my experience with some large 
rebalancing on a cluster with 200 OSDs it that the result is pretty good.

If your test cluster is small enough, it may be that the current 
placement is not really sensitive to the change of the failure domain as 
there is not enough different placement options for the placement 
algorithm to change something. Take it as an excpetion rather than the 
normal behaviour.

Cheers,

Michel

Le 15/07/2023 à 20:02, Niklas Hambüchen a écrit :
Hi Ceph users,

I have a Ceph 16.2.7 cluster that so far has been replicated over the 
`host` failure domain.
All `hosts` have been chosen to be in different `datacenter`s, so that 
was sufficient.

Now I wish to add more hosts, including some in already-used data 
centers, so I'm planning to use CRUSH's `datacenter` failure domain 
instead.

My problem is that when I add the `datacenter`s into the CRUSH tree, 
Ceph decides that it should now rebalance the entire cluster.
This seems unnecessary, and wrong.

Before, `ceph osd tree` (some OSDs omitted for legibility):

    ID   CLASS  WEIGHT     TYPE NAME                STATUS REWEIGHT  
PRI-AFF
     -1         440.73514  root default
     -3         146.43625      host node-4
      2    hdd   14.61089          osd.2                up 1.00000  
1.00000
      3    hdd   14.61089          osd.3                up 1.00000  
1.00000
     -7         146.43625      host node-5
     14    hdd   14.61089          osd.14               up 1.00000  
1.00000
     15    hdd   14.61089          osd.15               up 1.00000  
1.00000
    -10         146.43625      host node-6
     26    hdd   14.61089          osd.26               up 1.00000  
1.00000
     27    hdd   14.61089          osd.27               up 1.00000  
1.00000

After assigning of `datacenter` crush buckets:

        ID   CLASS  WEIGHT     TYPE NAME STATUS  REWEIGHT  PRI-AFF
     -1         440.73514  root default
    -18         146.43625      datacenter FSN-DC16
     -7         146.43625          host node-5
     14    hdd   14.61089              osd.14               up 
1.00000  1.00000
     15    hdd   14.61089              osd.15               up 
1.00000  1.00000
    -17         146.43625      datacenter FSN-DC18
    -10         146.43625          host node-6
     26    hdd   14.61089              osd.26               up 
1.00000  1.00000
     27    hdd   14.61089              osd.27               up 
1.00000  1.00000
    -16         146.43625      datacenter FSN-DC4
     -3         146.43625          host node-4
      2    hdd   14.61089              osd.2                up 
1.00000  1.00000
      3    hdd   14.61089              osd.3                up 
1.00000  1.00000

This shows that the tree is essentially unchanged, it just "gained a 
level".

In `ceph status` I now get:

    pgs:     1167541260/1595506041 objects misplaced (73.177%)

If I remove the `datacenter` level again, then the misplacement 
disappears.

On a minimal testing cluster, this misplacement issue did not appear.

Why does Ceph think that these objects are misplaced when I add the 
datacenter level?
Is there a more correct way to do this?

Thanks!
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx