Re: PG calculator improvement

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I think what fits the need of Frédéric while not impacting the complexity of the tool for new users would be a list of known "gotchas" in PG counts.  Like not having a Base2 count of PGs will cause each PG to be variable sized (for each PG past the last Base2, you have 2 PGs that are half the size of the others); Having less than X number of PG's for so much data on your amount of OSDs will cause balance problems; Having more than X number of objects for the PG's selected will cause issues; Having more than X number of PG's per OSD total (not just per pool) can cause high memory requirements (this is especially important for people setting up multiple RGW zones); etc.

On Thu, Apr 13, 2017 at 12:58 PM Michael Kidd <linuxkidd@xxxxxxxxxx> wrote:
Hello Frédéric,
  Thank you very much for the input.  I would like to ask for some feedback from you, as well as the ceph-users list at large.  

The PGCalc tool was created to help steer new Ceph users in the right direction, but it's certainly difficult to account for every possible scenario.  I'm struggling to find a way to implement something that would work better for the scenario that you (Frédéric) describe, while still being a useful starting point for the novice / more mainstream use cases.  I've also gotten complaints at the other end of the spectrum, that the tool expects the user to know too much already, so accounting for the number of objects is bound to add to this sentiment.

As the Ceph user base expands and the use cases diverge, we are definitely finding more edge cases that are causing pain.  I'd love to make something to help prevent these types of issues, but again, I worry about the complexity introduced.

With this, I see a few possible ways forward:
* Simply re-wroding the %data to be % object count -- but this seems more abstract, again leading to more confusion of new users.
* Increase complexity of the PG Calc tool, at the risk of further alienating novice/mainstream users
* Add a disclaimer about the tool being a base for decision making, but that certain edge cases require adjustments to the recommended PG count and/or ceph.conf & sysctl values.
* Add a disclaimer urging the end user to secure storage consulting if their use case falls into certain categories or they are new to Ceph to ensure the cluster will meet their needs.

Having been on the storage consulting team and knowing the expertise they have, I strongly believe that newcomers to Ceph (or new use cases inside of established customers) should secure consulting before final decisions are made on hardware... let alone the cluster is deployed.  I know it seems a bit self-serving to make this suggestion as I work at Red Hat, but there is a lot on the line when any establishment is storing potentially business critical data.

I suspect the answer lies in a combination of the above or in something I've not thought of.  Please do weigh in as any and all suggestions are more than welcome.

Thanks,
Michael J. Kidd
Principal Software Maintenance Engineer
Red Hat Ceph Storage
+1 919-442-8878


On Wed, Apr 12, 2017 at 6:35 AM, Frédéric Nass <frederic.nass@xxxxxxxxxxxxxxxx> wrote:

Hi,

I wanted to share a bad experience we had due to how the PG calculator works.

When we set our production cluster months ago, we had to decide on the number of PGs to give to each pool in the cluster.
As you know, the PG calc would recommended to give a lot of PGs to heavy pools in size, regardless the number of objects in the pools. How bad...

We essentially had 3 pools to set on 144 OSDs :

1. a EC5+4 pool for the radosGW (.rgw.buckets) that would hold 80% of all datas in the cluster. PG calc recommended 2048 PGs.
2. a EC5+4 pool for zimbra's data (emails) that would hold 20% of all datas. PG calc recommended 512 PGs.
3. a replicated pool for zimbra's metadata (null size objects holding xattrs - used for deduplication) that would hold 0% of all datas. PG calc recommended 128 PGs, but we decided on 256.

With 120M of objects in pool #3, as soon as we upgraded to Jewel, we hit the Jewel scrubbing bug (OSDs flapping).
Before we could upgrade to patched Jewel, scrub all the cluster again prior to increasing the number of PGs on this pool, we had to take more than a hundred of snapshots (for backup/restoration purposes), with the number of objects still increasing in the pool. Then when a snapshot was removed, we hit the current Jewel snap trimming bug affecting pools with too many objects for the number of PGs. The only way we could stop the trimming was to stop OSDs resulting in PGs being degraded and not trimming anymore (snap trimming only happens on active+clean PGs).

We're now just getting out of this hole, thanks to Nick's post regarding osd_snap_trim_sleep and RHCS support expertise.

If the PG calc had considered not only the pools weight but also the number of expected objects in the pool (which we knew by that time), we wouldn't have it these 2 bugs.
We hope this will help improving the ceph.com and RHCS PG calculators.

Regards,

Frédéric.

--

Frédéric Nass

Sous-direction Infrastructures
Direction du Numérique
Université de Lorraine

Tél : +33 3 72 74 11 35

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux