Hi Niklas,
I am not sure why you are surprised. In a large cluster, you should
expect some rebalancing on every crush map or crush map rule change.
Ceph doesn't just enforce the failure domain, it also whants to have a
"perfect" pseudo-random distribution across the clusters based on the
crush map hierarchy and the rules you provided. Rebalancing in itself is
not a problem, apart from adding some load to your cluster. As you are
running Pacific, you need to tune properly your osd_max_backfills limit
so that the rebalancing is fast enough and doesn't stress too much your
cluster. Since Quincy there is a new scheduler that doesn't rely on
these settings but tries to achieve fairness at the OSD level between
the different type of loads and my experience with some large
rebalancing on a cluster with 200 OSDs it that the result is pretty good.
If your test cluster is small enough, it may be that the current
placement is not really sensitive to the change of the failure domain as
there is not enough different placement options for the placement
algorithm to change something. Take it as an excpetion rather than the
normal behaviour.
Cheers,
Michel
Le 15/07/2023 à 20:02, Niklas Hambüchen a écrit :
Hi Ceph users,
I have a Ceph 16.2.7 cluster that so far has been replicated over the
`host` failure domain.
All `hosts` have been chosen to be in different `datacenter`s, so that
was sufficient.
Now I wish to add more hosts, including some in already-used data
centers, so I'm planning to use CRUSH's `datacenter` failure domain
instead.
My problem is that when I add the `datacenter`s into the CRUSH tree,
Ceph decides that it should now rebalance the entire cluster.
This seems unnecessary, and wrong.
Before, `ceph osd tree` (some OSDs omitted for legibility):
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT
PRI-AFF
-1 440.73514 root default
-3 146.43625 host node-4
2 hdd 14.61089 osd.2 up 1.00000
1.00000
3 hdd 14.61089 osd.3 up 1.00000
1.00000
-7 146.43625 host node-5
14 hdd 14.61089 osd.14 up 1.00000
1.00000
15 hdd 14.61089 osd.15 up 1.00000
1.00000
-10 146.43625 host node-6
26 hdd 14.61089 osd.26 up 1.00000
1.00000
27 hdd 14.61089 osd.27 up 1.00000
1.00000
After assigning of `datacenter` crush buckets:
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 440.73514 root default
-18 146.43625 datacenter FSN-DC16
-7 146.43625 host node-5
14 hdd 14.61089 osd.14 up
1.00000 1.00000
15 hdd 14.61089 osd.15 up
1.00000 1.00000
-17 146.43625 datacenter FSN-DC18
-10 146.43625 host node-6
26 hdd 14.61089 osd.26 up
1.00000 1.00000
27 hdd 14.61089 osd.27 up
1.00000 1.00000
-16 146.43625 datacenter FSN-DC4
-3 146.43625 host node-4
2 hdd 14.61089 osd.2 up
1.00000 1.00000
3 hdd 14.61089 osd.3 up
1.00000 1.00000
This shows that the tree is essentially unchanged, it just "gained a
level".
In `ceph status` I now get:
pgs: 1167541260/1595506041 objects misplaced (73.177%)
If I remove the `datacenter` level again, then the misplacement
disappears.
On a minimal testing cluster, this misplacement issue did not appear.
Why does Ceph think that these objects are misplaced when I add the
datacenter level?
Is there a more correct way to do this?
Thanks!
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx