Re: PG calculator improvement

David Turner <drakonstein@xxxxxxxxx> · Thu, 13 Apr 2017 17:08:15 +0000

I think what fits the need of Frédéric while not impacting the complexity of the tool for new users would be a list of known "gotchas" in PG counts.  Like not having a Base2 count of PGs will cause each PG to be variable sized (for each PG past the last Base2, you have 2 PGs that are half the size of the others); Having less than X number of PG's for so much data on your amount of OSDs will cause balance problems; Having more than X number of objects for the PG's selected will cause issues; Having more than X number of PG's per OSD total (not just per pool) can cause high memory requirements (this is especially important for people setting up multiple RGW zones); etc.

On Thu, Apr 13, 2017 at 12:58 PM Michael Kidd <linuxkidd@xxxxxxxxxx> wrote:
Hello Frédéric,  Thank you very much for the input.  I would like to ask for some feedback from you, as well as the ceph-users list at large.  

The
 PGCalc tool was created to help steer new Ceph users in the right 
direction, but it's certainly difficult to account for every possible 
scenario.  I'm struggling to find a way to implement something that 
would work better for the scenario that you (Frédéric)
 describe, while still being a useful starting point for the novice / 
more mainstream use cases.  I've also gotten complaints at the other end
 of the spectrum, that the tool expects the user to know too much 
already, so accounting for the number of objects is bound to add to this
 sentiment.

As
 the Ceph user base expands and the use cases diverge, we are definitely
 finding more edge cases that are causing pain.  I'd love to make 
something to help prevent these types of issues, but again, I worry 
about the complexity introduced.

With this, I see a few possible ways forward:
* Simply re-wroding the %data to be % 
object count -- but this seems more abstract, again leading to more 
confusion of new users.
* Increase complexity of the PG Calc tool, at the risk of further alienating novice/mainstream users
*
 Add a disclaimer about the tool being a base for decision making, but 
that certain edge cases require adjustments to the recommended PG count 
and/or ceph.conf & sysctl values.
*
 Add a disclaimer urging the end user to secure storage consulting if 
their use case falls into certain categories or they are new to Ceph to 
ensure the cluster will meet their needs.

Having
 been on the storage consulting team and knowing the expertise they 
have, I strongly believe that newcomers to Ceph (or new use cases inside
 of established customers) should secure consulting before final 
decisions are made on hardware... let alone the cluster is deployed.  I 
know it seems a bit self-serving to make this suggestion as I work at Red Hat, but there is a lot on the line when any establishment is storing potentially business critical data.

I suspect the answer lies in a combination of the above or in something I've not thought of.  Please do weigh in as any and all suggestions are more than welcome.

Thanks,
Michael J. Kidd
Principal Software Maintenance Engineer
Red Hat Ceph Storage
+1 919-442-8878

On Wed, Apr 12, 2017 at 6:35 AM, Frédéric Nass <frederic.nass@xxxxxxxxxxxxxxxx> wrote:

Hi,

I wanted to share a bad experience we had due to how the PG calculator works.

When we set our production cluster months ago, we had to decide on the number of PGs to give to each pool in the cluster.

As you know, the PG calc would recommended to give a lot of PGs to heavy pools in size, regardless the number of objects in the pools. How bad...

We essentially had 3 pools to set on 144 OSDs :

1. a EC5+4 pool for the radosGW (.rgw.buckets) that would hold 80% of all datas in the cluster. PG calc recommended 2048 PGs.

2. a EC5+4 pool for zimbra's data (emails) that would hold 20% of all datas. PG calc recommended 512 PGs.

3. a replicated pool for zimbra's metadata (null size objects holding xattrs - used for deduplication) that would hold 0% of all datas. PG calc recommended 128 PGs, but we decided on 256.

With 120M of objects in pool #3, as soon as we upgraded to Jewel, we hit the Jewel scrubbing bug (OSDs flapping).

Before we could upgrade to patched Jewel, scrub all the cluster again prior to increasing the number of PGs on this pool, we had to take more than a hundred of snapshots (for backup/restoration purposes), with the number of objects still increasing in the pool. Then when a snapshot was removed, we hit the current Jewel snap trimming bug affecting pools with too many objects for the number of PGs. The only way we could stop the trimming was to stop OSDs resulting in PGs being degraded and not trimming anymore (snap trimming only happens on active+clean PGs).

We're now just getting out of this hole, thanks to Nick's post regarding osd_snap_trim_sleep and RHCS support expertise.

If the PG calc had considered not only the pools weight but also the number of expected objects in the pool (which we knew by that time), we wouldn't have it these 2 bugs.

We hope this will help improving the ceph.com and RHCS PG calculators.

Regards,

Frédéric.

-- 

Frédéric Nass

Sous-direction Infrastructures

Direction du Numérique

Université de Lorraine

Tél : +33 3 72 74 11 35

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com