Re: Help: Balancing Ceph OSDs with different capacity

Jasper Tan <jasper.tan@xxxxxxxxxxxxxx> · Thu, 8 Feb 2024 09:59:41 +0800

Hi Anthony and everyone else

We have found the issue. Because the new 20x 14 TiB OSDs were onboarded
onto a single node, there was not only an imbalance in the capacity of each
OSD but also between the nodes (other nodes each have around 15x 1.7TiB).
Furthermore, CRUSH rule sets default failure domain to host with 3x
replication. This means that 1 of the copies will reside on a PG within the
node with 20x 14TiB while 2/3 of the replicated copies are forced to be on
the other nodes with 1.7TiB regardless of the weight as there are no other
alternatives. Changing the failure domain from host to osd resolved the
issue and I was able to achieve perfect balance at the cost of redundancy.
Moving forward we will physically rearrange the OSDs on each node.

Thanks
Jasper Tan

On Thu, Feb 8, 2024 at 3:29 AM Anthony D'Atri <aad@xxxxxxxxxxxxxx> wrote:

>
>
> > I have recently onboarded new OSDs into my Ceph Cluster. Previously, I
> had
> > 44 OSDs of 1.7TiB each and was using it for about a year. About 1 year
> ago,
> > we onboarded an additional 20 OSDs of 14TiB each.
>
> That's a big difference in size.  I suggest increasing
> mon_max_pg_per_osd  to 1000 -- that will help avoid unpleasantness when a
> component fails, including PGs or OSDs that won't activate.
>
> > However I observed that many of the data were still being written onto
> the
> > original 1.7TiB OSDs instead of the 14TiB ones. Overtime, this caused a
> > bottleneck as the 1.7TiB OSDs reached nearfull capacity.
>
> Please share your Ceph release, `ceph osd tree`, and and what your pool
> definitions and CRUSH rules look like.
>
>
> > I have tried to perform a reweight (both crush reweight and reweight) to
> > reduce the number of PGs on each 1.7TiB. This worked temporarily but
> > resulted in many objects being misplaced and PGs being in a Warning
> state.
>
> Misplaced objects are natural in such an operation.  With recent Ceph
> releases you shouldn't have to do this.  You have the balancer module
> enabled?
>
> > Subsequently I have also tried using crush-compat balancer mode instead
> of
> > upmap but did not see significant improvement. The latest changes I made
> > was to change backfill-threshold to 0.85, hoping that PGs will no longer
> be
> > assigned to OSDs that are >85% utilization.
>
> No, that'll just stop backfill from happening.  That ratio is for a
> different purpose.
>
>
> > However, this did not change the situation much as I see many OSDs above
> >85% utilization today.
> >
> > Attached is a report from ceph report command.
>
> Attachments don't make it through to the list.
>
> I suspect that what you're seeing is a misalignment of your CRUSH rules
> and your cluster topology:
>
> * Maybe your 1.7 TB OSDs are the ssd deviceclass and the 14 TB SSDs are
> the hdd device class.  If your CRUSH rule(s) specify the ssd device class,
> they won't use the new OSDs
> * Say you have failure domain = host, and all the 14TB OSDs are on one or
> two hosts.  Your CRUSH rules may force the smaller OSDs to be selected for
> PGs to satisfy anti-affinity
> * Similarly if you have rack failure domain and all the larger OSDs are in
> the same rack.
>
>
>

--

-- 

--

*The contents of this e-mail message and any attachments are 
confidential and are intended solely
for addressee. The information may 
also be legally privileged. This transmission is sent in trust, for
the 
sole purpose of delivery to the intended recipient. If you have received 
this transmission in error,
any use, reproduction or dissemination of this 
transmission is strictly prohibited. If you are not the
intended recipient, 
please immediately NOTIFY the sender by reply e-mail or phone and DELETE
this message and its attachments, if any.*
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx