Hi Anthony and everyone else We have found the issue. Because the new 20x 14 TiB OSDs were onboarded onto a single node, there was not only an imbalance in the capacity of each OSD but also between the nodes (other nodes each have around 15x 1.7TiB). Furthermore, CRUSH rule sets default failure domain to host with 3x replication. This means that 1 of the copies will reside on a PG within the node with 20x 14TiB while 2/3 of the replicated copies are forced to be on the other nodes with 1.7TiB regardless of the weight as there are no other alternatives. Changing the failure domain from host to osd resolved the issue and I was able to achieve perfect balance at the cost of redundancy. Moving forward we will physically rearrange the OSDs on each node. Thanks Jasper Tan On Thu, Feb 8, 2024 at 3:29 AM Anthony D'Atri <aad@xxxxxxxxxxxxxx> wrote: > > > > I have recently onboarded new OSDs into my Ceph Cluster. Previously, I > had > > 44 OSDs of 1.7TiB each and was using it for about a year. About 1 year > ago, > > we onboarded an additional 20 OSDs of 14TiB each. > > That's a big difference in size. I suggest increasing > mon_max_pg_per_osd to 1000 -- that will help avoid unpleasantness when a > component fails, including PGs or OSDs that won't activate. > > > However I observed that many of the data were still being written onto > the > > original 1.7TiB OSDs instead of the 14TiB ones. Overtime, this caused a > > bottleneck as the 1.7TiB OSDs reached nearfull capacity. > > Please share your Ceph release, `ceph osd tree`, and and what your pool > definitions and CRUSH rules look like. > > > > I have tried to perform a reweight (both crush reweight and reweight) to > > reduce the number of PGs on each 1.7TiB. This worked temporarily but > > resulted in many objects being misplaced and PGs being in a Warning > state. > > Misplaced objects are natural in such an operation. With recent Ceph > releases you shouldn't have to do this. You have the balancer module > enabled? > > > Subsequently I have also tried using crush-compat balancer mode instead > of > > upmap but did not see significant improvement. The latest changes I made > > was to change backfill-threshold to 0.85, hoping that PGs will no longer > be > > assigned to OSDs that are >85% utilization. > > No, that'll just stop backfill from happening. That ratio is for a > different purpose. > > > > However, this did not change the situation much as I see many OSDs above > >85% utilization today. > > > > Attached is a report from ceph report command. > > Attachments don't make it through to the list. > > I suspect that what you're seeing is a misalignment of your CRUSH rules > and your cluster topology: > > * Maybe your 1.7 TB OSDs are the ssd deviceclass and the 14 TB SSDs are > the hdd device class. If your CRUSH rule(s) specify the ssd device class, > they won't use the new OSDs > * Say you have failure domain = host, and all the 14TB OSDs are on one or > two hosts. Your CRUSH rules may force the smaller OSDs to be selected for > PGs to satisfy anti-affinity > * Similarly if you have rack failure domain and all the larger OSDs are in > the same rack. > > > -- -- -- *The contents of this e-mail message and any attachments are confidential and are intended solely for addressee. The information may also be legally privileged. This transmission is sent in trust, for the sole purpose of delivery to the intended recipient. If you have received this transmission in error, any use, reproduction or dissemination of this transmission is strictly prohibited. If you are not the intended recipient, please immediately NOTIFY the sender by reply e-mail or phone and DELETE this message and its attachments, if any.* _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx