Re: Adding datacenter level to CRUSH tree causes rebalancing

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Niklas,

I am not sure why you are surprised. In a large cluster, you should expect some rebalancing on every crush map or crush map rule change. Ceph doesn't just enforce the failure domain, it also whants to have a "perfect" pseudo-random distribution across the clusters based on the crush map hierarchy and the rules you provided. Rebalancing in itself is not a problem, apart from adding some load to your cluster. As you are running Pacific, you need to tune properly your osd_max_backfills limit so that the rebalancing is fast enough and doesn't stress too much your cluster. Since Quincy there is a new scheduler that doesn't rely on these settings but tries to achieve fairness at the OSD level between the different type of loads and my experience with some large rebalancing on a cluster with 200 OSDs it that the result is pretty good.

If your test cluster is small enough, it may be that the current placement is not really sensitive to the change of the failure domain as there is not enough different placement options for the placement algorithm to change something. Take it as an excpetion rather than the normal behaviour.

Cheers,

Michel

Le 15/07/2023 à 20:02, Niklas Hambüchen a écrit :
Hi Ceph users,

I have a Ceph 16.2.7 cluster that so far has been replicated over the `host` failure domain. All `hosts` have been chosen to be in different `datacenter`s, so that was sufficient.

Now I wish to add more hosts, including some in already-used data centers, so I'm planning to use CRUSH's `datacenter` failure domain instead.

My problem is that when I add the `datacenter`s into the CRUSH tree, Ceph decides that it should now rebalance the entire cluster.
This seems unnecessary, and wrong.

Before, `ceph osd tree` (some OSDs omitted for legibility):


    ID   CLASS  WEIGHT     TYPE NAME                STATUS REWEIGHT  PRI-AFF
     -1         440.73514  root default
     -3         146.43625      host node-4
      2    hdd   14.61089          osd.2                up 1.00000  1.00000       3    hdd   14.61089          osd.3                up 1.00000  1.00000
     -7         146.43625      host node-5
     14    hdd   14.61089          osd.14               up 1.00000  1.00000      15    hdd   14.61089          osd.15               up 1.00000  1.00000
    -10         146.43625      host node-6
     26    hdd   14.61089          osd.26               up 1.00000  1.00000      27    hdd   14.61089          osd.27               up 1.00000  1.00000


After assigning of `datacenter` crush buckets:

        ID   CLASS  WEIGHT     TYPE NAME STATUS  REWEIGHT  PRI-AFF
     -1         440.73514  root default
    -18         146.43625      datacenter FSN-DC16
     -7         146.43625          host node-5
     14    hdd   14.61089              osd.14               up 1.00000  1.00000      15    hdd   14.61089              osd.15               up 1.00000  1.00000
    -17         146.43625      datacenter FSN-DC18
    -10         146.43625          host node-6
     26    hdd   14.61089              osd.26               up 1.00000  1.00000      27    hdd   14.61089              osd.27               up 1.00000  1.00000
    -16         146.43625      datacenter FSN-DC4
     -3         146.43625          host node-4
      2    hdd   14.61089              osd.2                up 1.00000  1.00000       3    hdd   14.61089              osd.3                up 1.00000  1.00000


This shows that the tree is essentially unchanged, it just "gained a level".

In `ceph status` I now get:

    pgs:     1167541260/1595506041 objects misplaced (73.177%)

If I remove the `datacenter` level again, then the misplacement disappears.

On a minimal testing cluster, this misplacement issue did not appear.

Why does Ceph think that these objects are misplaced when I add the datacenter level?
Is there a more correct way to do this?


Thanks!
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux