Re: PG Sizing Question

Deep Dish <deeepdish@xxxxxxxxx> · Wed, 1 Mar 2023 15:59:19 -0500

Thank you for this perspective Anthony.   
I was honestly hoping the autoscaler would work in my case, however I had less than desired results with it.   On 17.2.5 it actually failed to scale as advertised.   I had a pool created via the web console, with 1 PG, then kicked off a job to migrate data.   Understandably the cluster wasn't optimal with several tens of terabytes in this pool with 1 PG.   So I've been manually scaling since.   I am using 12Gbit/s SAS spinners on variants of 6Gbit/s or 12Gbit/s backplanes.   Either way, 4-6Gbit/s of throughput per OSD is designed in with each OSD node.  

Memory (assuming monitors) is something that can be adjusted as well.   I did notice that with many pools (10+) and a total target of 100 PGs / OSD across the cluster, it's somewhat difficult to attain an even distribution across all OSDs, leaving some running warmer than others in terms of capacity utilization leading to risk of prematurely filling up.   

I was hoping the guidance would be per pool vs. cluster wide for PG / OSD.  If this is indeed the recommended spec, I'll have to rethink the pools we have and their purpose / utilization.  Looking forward to additional perspectives, best practices around this.  By the sounds of it, a cluster may be configured for the 100 PG / OSD target; adding pools to the former configuration scenario will require an increase in OSDs to maintain that recommended PG distribution target and accommodate an increase of PGs resulting from additional pools.    

On Wed, Mar 1, 2023 at 12:58 AM Anthony D'Atri <aad@xxxxxxxxxxxxxx> wrote:
This can be subtle and is easy to mix up.
The “PG ratio” is intended to be the number of PGs hosted on each OSD, plus or minus a few.

Note how I phrased that, it’s not the number of PGs divided by the number of OSDs.  Remember that PGs are replicated.

While each PG belongs to exactly one pool, for purposes of estimating pg_num, we calculate the desired aggregate number of PGs on this ratio, then divide that up among pools, ideally split into powers of 2 per pool, relative to the amount of data in the pool.

You can run `ceph osd df` and see the number of PGs on each OSD.  There will be some variance, but consider the average.

This venerable calculator:

PGCalc
old.ceph.com

can help get a feel for how this works.

100 is the official party line, it used to be 200.  More PGs means more memory use; too few has various other drawbacks.

PGs can in part be thought of as parallelism domains; more PGs means more parallelism.  So on HDDs, a ratio in the 100-200 range is IMHO reasonable.  SAS/SATA OSDs 200-300, NVMe OSDs perhaps higher, though perhaps not if each device hosts more than one OSD (which should only ever be done on NVMe devices).

Your numbers below are probably ok for HDDs, you might bump the pool with the most data up to the next power of 2 if these are SSDs.

The pgcalc above includes parameters for what fraction of the cluster’s data each pool contains.  A pool with 5% of the data needs fewer PGs than a pool with 50% of the cluster’s data.

Others may well have different perspectives, this is something where opinions vary.  The pg_autoscaler in bulk mode can automate this, if one is prescient with feeding it parameters.

On Feb 28, 2023, at 9:23 PM, Deep Dish <deeepdish@xxxxxxxxx> wrote:

Hello

Looking to get some official guidance on PG and PGP sizing.

Is the goal to maintain approximately 100 PGs per OSD per pool or for the
cluster general?

Assume the following scenario:

Cluster with 80 OSD across 8 nodes;

3 Pools:

-       Pool1 = Replicated 3x

-       Pool2 = Replicated 3x

-       Pool3 = Erasure Coded 6-4

Assuming the well published formula:

Let (Target PGs / OSD) = 100

[ (Target PGs / OSD) * (# of OSDs) ] / (Replica Size)

-       Pool1 = (100*80)/3 = 2666.67 => 4096

-       Pool2 = (100*80)/3 = 2666.67 => 4096

-       Pool3 = (100*80)/10 = 800 => 1024

Total cluster would have 9216 PGs and PGPs.

Are there any implications (performance / monitor / MDS / RGW sizing) with
how many PGs are created on the cluster?

Looking for validation and / or clarification of the above.

Thank you.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx