+1 for increasing PG numbers, those are quite low.
Zitat von Bailey Allison <ballison@xxxxxxxxxxxx>:
Hi Reed,
Just taking a quick glance at the Pastebin provided I have to say
your cluster balance is already pretty damn good all things
considered.
We've seen the upmap balancer at it's best in practice provides a
deviation of about 10-20% percent across OSDs which seems to be
matching up on your cluster. It's something that as the more nodes
and OSDs you add that are equal in size to the cluster, and as the
PGs increase on the cluster it can do a better and better job of,
but in practice about a 10% difference in OSDs is very normal.
Something to note in the video provided is that they were using a
cluster with 28PB of storage available, so who knows how many
OSDs/nodes/PGs per pool/etc., that their cluster has the luxury and
ability to balance across.
The only thing I can think to suggest is just increasing the PG
count as you've already mentioned. The ideal setting is about 100
PGs per OSD, and looking at your cluster both the SSDs and the
smaller HDDs have only about 50 PGs per OSD.
If you're able to get both of those devices to a closer to 100 PG
per OSD ratio it should help a lot more with the balancing. More PGs
means more places to distribute data.
It will be tricky in that I am just noticing for the HDDs you have
some hosts/chassis with 24 OSDs per and others with 6 HDDs per so
getting the PG distribution more even for those will be challenging,
but for the SSDs it should be quite simple to get those to be 100
PGs per OSD.
Just taking a further look it does appear on some OSDs although I
will say across the entire cluster the actual data stored is
balanced good, there are a couple of OSDs where the OMAP/metadata is
not balanced as well as the others.
Where you are using EC pools for CephFS, any OMAP data cannot be
stored within EC so it will store all of that within a replication
data cephfs pool, most likely your hdd_cephfs pool.
Just something to keep in mind as not only is it important to make
sure the data is balanced, but the OMAP data and metadata are
balanced as well.
Otherwise though I would recommended just trying to get your cluster
to a point where each of the OSDs have roughly 100 PGs per OSD, or
at least as close to this as you are able to given your clusters
crush rulesets.
This should then help the balancer spread the data across the
cluster, but again unless I overlooked something your cluster
already appears to be extremely well balanced.
There is a PG calculator you can use online at:
https://old.ceph.com/pgcalc/
There is also a PG calc on the Redhat website but it requires a subscription.
Both calculators are essentially the same but I have noticed the
free one will round down the PGs and the Redhat one will round up
the PGs.
Regards,
Bailey
-----Original Message-----
From: Reed Dier <reed.dier@xxxxxxxxxxx>
Sent: September 22, 2022 4:48 PM
To: ceph-users <ceph-users@xxxxxxx>
Subject: Balancer Distribution Help
Hoping someone can point me to possible tunables that could
hopefully better tighten my OSD distribution.
Cluster is currently
"ceph version 15.2.16 (d46a73d6d0a67a79558054a3a5a72cb561724974)
octopus (stable)": 307
With plans to begin moving to pacific before end of year, with a
possible interim stop at octopus.17 on the way.
Cluster was born on jewel, and is fully bluestore/straw2.
The upmap balancer works/is working, but not to the degree that I
believe it could/should work, which seems should be much closer to
near perfect than what I’m seeing.
https://imgur.com/a/lhtZswo <https://imgur.com/a/lhtZswo> <-
Histograms of my OSD distribution
https://pastebin.com/raw/dk3fd4GH
<https://pastebin.com/raw/dk3fd4GH> <- pastebin of
cluster/pool/crush relevant bits
To put it succinctly, I’m hoping to get much tighter OSD
distribution, but I’m not sure what knobs to try turning next, as
the upmap balancer has gone as far as it can, and I end up playing
“reweight the most full OSD whack-a-mole as OSD’s get nearful.”
My goal is obviously something akin to this perfect distribution
like here: https://www.youtube.com/watch?v=niFNZN5EKvE&t=1353s
<https://www.youtube.com/watch?v=niFNZN5EKvE&t=1353s>
I am looking to tweak the PG counts for a few pool.
Namely the ssd-radosobj has shrunk in size and needs far fewer PGs now.
Similarly hdd-cephfs shrunk in size as well and needs fewer PGs (as
ceph health shows).
And on the flip side, ec*-cephfs likely need more PGs as they have
grown in size.
However I was hoping to get more breathing room of free space on my
most full OSDs before starting to do big PG expand/shrink.
I am assuming that my whacky mix of replicated vs multiple EC
storage pools coupled with hybrid SSD+HDD pools is throwing off the
balance more than if it was a more homogenous crush ruleset, but
this is what exists and is what I’m working with.
Also, since it will look odd in the tree view, the crush rulesets
for hdd pools are chooseleaf chassis, while ssd pools are chooseleaf
host.
Any tips or help would be greatly appreciated.
Reed
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an
email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx