Re: Help: Balancing Ceph OSDs with different capacity

"Anthony D'Atri" <aad@xxxxxxxxxxxxxx> · Wed, 7 Feb 2024 14:29:01 -0500

> I have recently onboarded new OSDs into my Ceph Cluster. Previously, I had
> 44 OSDs of 1.7TiB each and was using it for about a year. About 1 year ago,
> we onboarded an additional 20 OSDs of 14TiB each.

That's a big difference in size.  I suggest increasing  mon_max_pg_per_osd  to 1000 -- that will help avoid unpleasantness when a component fails, including PGs or OSDs that won't activate.

> However I observed that many of the data were still being written onto the
> original 1.7TiB OSDs instead of the 14TiB ones. Overtime, this caused a
> bottleneck as the 1.7TiB OSDs reached nearfull capacity.

Please share your Ceph release, `ceph osd tree`, and and what your pool definitions and CRUSH rules look like.

> I have tried to perform a reweight (both crush reweight and reweight) to
> reduce the number of PGs on each 1.7TiB. This worked temporarily but
> resulted in many objects being misplaced and PGs being in a Warning state.

Misplaced objects are natural in such an operation.  With recent Ceph releases you shouldn't have to do this.  You have the balancer module enabled?

> Subsequently I have also tried using crush-compat balancer mode instead of
> upmap but did not see significant improvement. The latest changes I made
> was to change backfill-threshold to 0.85, hoping that PGs will no longer be
> assigned to OSDs that are >85% utilization.

No, that'll just stop backfill from happening.  That ratio is for a different purpose.

> However, this did not change the situation much as I see many OSDs above >85% utilization today.
> 
> Attached is a report from ceph report command.

Attachments don't make it through to the list.

I suspect that what you're seeing is a misalignment of your CRUSH rules and your cluster topology:

* Maybe your 1.7 TB OSDs are the ssd deviceclass and the 14 TB SSDs are the hdd device class.  If your CRUSH rule(s) specify the ssd device class, they won't use the new OSDs
* Say you have failure domain = host, and all the 14TB OSDs are on one or two hosts.  Your CRUSH rules may force the smaller OSDs to be selected for PGs to satisfy anti-affinity
* Similarly if you have rack failure domain and all the larger OSDs are in the same rack.

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx