Re: [ceph-users] PG calculator improvement

Frédéric Nass <frederic.nass@xxxxxxxxxxxxxxxx> · Fri, 14 Apr 2017 17:56:54 +0200

Hi Michael, David,

Actually, we did start with a lot of work (and then a lot of work :-)) 
and with the help of a RHCS consultant (an Inktank pioneer :-)) during a 
5 days on-site Jumpstart.

With his precious help, we deployed our production cluster, set the 
right options in ceph.conf, the right crushmap and crush rules regarding 
our failure domains and failure tolerances.
I remember him talking about 2048 or 4096 PGs for this pool. Still, once 
it was time to decide (months laters), my colleague and I disagreed on 
how much PGs we should set on this pool, essentially as this pool was 
not supposed to hold the majority of the datas in the cluster. So we 
used the RHCS PG calculator that said go for 128 PGs. I thought we would 
still be able to increase that number in the future, but we couldn't. 
Not before we hit the scrubbing and snap trimming bugs.

Now what might help with the PG calculator would be:

- a new type column (RBD, RGW, Rados)
- a "data in TB"
- a new "object count" column (that would be set to "data in TB" * 
1.000.000 / 4MB if type is RBD or RGW or left to the user choice if type 
Rados)
- a new "object size" column (that would be 4MB if type is RBD or RGW or 
left to the user choice if type Rados)

If type Rados, "object size" column could change on "data in TB" and/or 
"object count" changes. I mean any of the 3 columns could change with 
any value change in the 2 other columns.

The %object count (percentage object count) would not help, as the user 
wouldn't know how to fill it, and we don't need to set PGs regarding the 
global number of objects of all pools, right ?
What we try to avoid is having PG directory trees with several hundreds 
of thousand of files in them as this is where the pain comes from 
regarding scrubbing, snap trimming, recovering, etc.

Now the tool should not only consider the amount of datas but also the 
number of objects to advise on the number of PGs to set on a pool.

I hope this gives clues to help redesigning the tool.

Best regards,

Frédéric.

Le 13/04/2017 à 19:08, David Turner a écrit :
I think what fits the need of Frédéric while not impacting the 
complexity of the tool for new users would be a list of known 
"gotchas" in PG counts.  Like not having a Base2 count of PGs will 
cause each PG to be variable sized (for each PG past the last Base2, 
you have 2 PGs that are half the size of the others); Having less than 
X number of PG's for so much data on your amount of OSDs will cause 
balance problems; Having more than X number of objects for the PG's 
selected will cause issues; Having more than X number of PG's per OSD 
total (not just per pool) can cause high memory requirements (this is 
especially important for people setting up multiple RGW zones); etc.

On Thu, Apr 13, 2017 at 12:58 PM Michael Kidd <linuxkidd@xxxxxxxxxx 
<mailto:linuxkidd@xxxxxxxxxx>> wrote:

    Hello Frédéric,
      Thank you very much for the input.  I would like to ask for some
    feedback from you, as well as the ceph-users list at large.

    The PGCalc tool was created to help steer new Ceph users in the
    right direction, but it's certainly difficult to account for every
    possible scenario.  I'm struggling to find a way to implement
    something that would work better for the scenario that you
    (Frédéric) describe, while still being a useful starting point for
    the novice / more mainstream use cases.  I've also gotten
    complaints at the other end of the spectrum, that the tool expects
    the user to know too much already, so accounting for the number of
    objects is bound to add to this sentiment.

    As the Ceph user base expands and the use cases diverge, we are
    definitely finding more edge cases that are causing pain.  I'd
    love to make something to help prevent these types of issues, but
    again, I worry about the complexity introduced.

    With this, I see a few possible ways forward:
    * Simply re-wroding the %data to be % object count -- but this
    seems more abstract, again leading to more confusion of new users.
    * Increase complexity of the PG Calc tool, at the risk of further
    alienating novice/mainstream users
    * Add a disclaimer about the tool being a base for decision
    making, but that certain edge cases require adjustments to the
    recommended PG count and/or ceph.conf & sysctl values.
    * Add a disclaimer urging the end user to secure storage
    consulting if their use case falls into certain categories or they
    are new to Ceph to ensure the cluster will meet their needs.

    Having been on the storage consulting team and knowing the
    expertise they have, I strongly believe that newcomers to Ceph (or
    new use cases inside of established customers) should secure
    consulting before final decisions are made on hardware... let
    alone the cluster is deployed.  I know it seems a bit self-serving
    to make this suggestionas I work at Red Hat, but there is a lot on
    the line when any establishment is storing potentially business
    critical data.

    I suspect the answer lies in a combination of the above or in
    something I've not thought of.Please do weigh in as any and all
    suggestions are more than welcome.

    Thanks,
    Michael J. Kidd
    Principal Software Maintenance Engineer
    Red Hat Ceph Storage
    +1 919-442-8878 <tel:%28919%29%20442-8878>

    On Wed, Apr 12, 2017 at 6:35 AM, Frédéric Nass
    <frederic.nass@xxxxxxxxxxxxxxxx
    <mailto:frederic.nass@xxxxxxxxxxxxxxxx>> wrote:

        Hi,

        I wanted to share a bad experience we had due to how the PG
        calculator works.

        When we set our production cluster months ago, we had to
        decide on the number of PGs to give to each pool in the cluster.
        As you know, the PG calc would recommended to give a lot of
        PGs to heavy pools in size, regardless the number of objects
        in the pools. How bad...

        We essentially had 3 pools to set on 144 OSDs :

        1. a EC5+4 pool for the radosGW (.rgw.buckets) that would hold
        80% of all datas in the cluster. PG calc recommended 2048 PGs.
        2. a EC5+4 pool for zimbra's data (emails) that would hold 20%
        of all datas. PG calc recommended 512 PGs.
        3. a replicated pool for zimbra's metadata (null size objects
        holding xattrs - used for deduplication) that would hold 0% of
        all datas. PG calc recommended 128 PGs, but we decided on 256.

        With 120M of objects in pool #3, as soon as we upgraded to
        Jewel, we hit the Jewel scrubbing bug (OSDs flapping).
        Before we could upgrade to patched Jewel, scrub all the
        cluster again prior to increasing the number of PGs on this
        pool, we had to take more than a hundred of snapshots (for
        backup/restoration purposes), with the number of objects still
        increasing in the pool. Then when a snapshot was removed, we
        hit the current Jewel snap trimming bug affecting pools with
        too many objects for the number of PGs. The only way we could
        stop the trimming was to stop OSDs resulting in PGs being
        degraded and not trimming anymore (snap trimming only happens
        on active+clean PGs).

        We're now just getting out of this hole, thanks to Nick's post
        regarding osd_snap_trim_sleep and RHCS support expertise.

        If the PG calc had considered not only the pools weight but
        also the number of expected objects in the pool (which we knew
        by that time), we wouldn't have it these 2 bugs.
        We hope this will help improving the ceph.com
        <http://ceph.com> and RHCS PG calculators.

        Regards,

        Frédéric.

        -- 

        Frédéric Nass

        Sous-direction Infrastructures
        Direction du Numérique
        Université de Lorraine

        Tél : +33 3 72 74 11 35 <tel:%2B33%203%2072%2074%2011%2035>

        _______________________________________________
        ceph-users mailing list
        ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
        http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

    _______________________________________________
    ceph-users mailing list
    ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
    http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html