Re: Best Practice for OSD Balancing

Rich Freeman <r-ceph@xxxxxxxxxxxx> · Wed, 29 Nov 2023 00:10:01 +0000

On Tue, Nov 28, 2023 at 6:25 PM Anthony D'Atri <aad@xxxxxxxxxxxxxx> wrote:
> Looks like one 100GB SSD OSD per host? This is AIUI the screaming minimum size for an OSD.  With WAL, DB, cluster maps, and other overhead there doesn’t end up being much space left for payload data.  On larger OSDs the overhead is much more into the noise floor.  Given the side of these SSD OSDs, I suspect at least one of the following is true?
>
> 1) They’re client aka desktop SSDs, not “enterprise”
> 2) They’re a partition of a larger OSD shared with other purposes

Yup.  They're a mix of SATA SSDs and NVMes, but everything is
consumer-grade.  They're only 10% full on average and I'm not
super-concerned with performance.  If they did get full I'd allocate
more space for them.  Performance is more than adequate for the very
light loads they have.

>
> I suspect that this alone would be enough to frustrate the balancer, which AFAIK doesn’t take overhead into consideration.  You might disable the balancer module, reset the reweights to 1.00, and try the JJ balancer but I dunno that it would be night vs day.

I'm not really all that concerned with SSD balancing, since if data
needs to be moved around it happens almost instantaneously.  They're
small and on 10GbE.

Also, there are no pools that cross the hdd/ssd device classes, so I
would hope the balancer wouldn't get confused by having both in the
cluster.

> min_alloc_size?  Were they created on an older Ceph release?  Current defaults for [non]rotational media are both 4KB; they used to be 64KB but were changed with some churn …. around the Pacific / Octopus era IIRC.  If you’re re-creating to minimize space amp, does that mean you’re running RGW with a significant fraction of small objects?  With RBD — or CephFS with larger files — that isn’t so much an issue.

They were created with 4k min_alloc_size. I'm increasing this to 64k
for the hdd osds.  I'm hoping that will improve performance a bit on
large files (average file size is multiple MB at least I think), and
if nothing else it seems to greatly reduce OSD RAM consumption so that
alone is useful.

>
> Unless you were to carefully segregate larger and smaller HDDs into separate pools, right-sizing the PG could is tricky.  144 is okay, 72 is a bit low, upstream guidance notwithstanding.  I would still bump some of your pg_nums a bit.

The larger OSDs (which is the bulk of the capacity) have 150+ PGs
right now.  The small ones of course have far less.  I might bump up
one of the CephFS pools as it is starting to accumulate a bit more
data.

>
>> pool 1 '.mgr' replicated size 3 min_size 2 crush_rule 7 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on pg_num_max 32 pg_num_min 1 application mgr
>
>
> Check the CRUSH rule for this pool.  On my clusters Rook creates it without specifying a device-class, but the other pools get rules that do specify a device class.

The .mgr pool has 1 pg, and is set to use ssd devices only.  Its 3
OSDs are all SSDs right now.

> So many pools for such a small cluster …. are you actively using CephFS, RBD, *and* RGW?  If not, I’d suggest removing whatever you aren’t using and adjusting pg_num for the pools you are using.

So, I'm using RBD on SSD (128 PGs - maybe a bit overkill for this but
those OSDs don't have anything else going on), and the bulk of the
storage is on CephFS on HDD with two pools.  I've been experimenting a
bit with RGW but those pools are basically empty and mostly have 8 PGs
each.

> Is that a 2,2 or 3,1 profile?

The EC pool?  That is k=2, m=2.  I am thinking about moving that to a
3+2 pool once I'm done with all the migration to be a bit more
space-efficient, but I only have 7 nodes and they aren't completely
balanced so I don't really want to stripe the data more than that.

It is interesting because Quincy had no issues with the autoscaler
with the exact same cluster config.  It might be a Rook issue, or it
might just be because so many PGs are remapped.  I'll take another
look at that once it reaches more of a steady state.

In any case, if the balancer is designed more for equal-sized OSDs I
can always just play with reweights to balance things.

--
Rich
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx