Hi Michael, David,
Actually, we did start with a lot of work (and then a lot of work :-))
and with the help of a RHCS consultant (an Inktank pioneer :-)) during a
5 days on-site Jumpstart.
With his precious help, we deployed our production cluster, set the
right options in ceph.conf, the right crushmap and crush rules regarding
our failure domains and failure tolerances.
I remember him talking about 2048 or 4096 PGs for this pool. Still, once
it was time to decide (months laters), my colleague and I disagreed on
how much PGs we should set on this pool, essentially as this pool was
not supposed to hold the majority of the datas in the cluster. So we
used the RHCS PG calculator that said go for 128 PGs. I thought we would
still be able to increase that number in the future, but we couldn't.
Not before we hit the scrubbing and snap trimming bugs.
Now what might help with the PG calculator would be:
- a new type column (RBD, RGW, Rados)
- a "data in TB"
- a new "object count" column (that would be set to "data in TB" *
1.000.000 / 4MB if type is RBD or RGW or left to the user choice if type
Rados)
- a new "object size" column (that would be 4MB if type is RBD or RGW or
left to the user choice if type Rados)
If type Rados, "object size" column could change on "data in TB" and/or
"object count" changes. I mean any of the 3 columns could change with
any value change in the 2 other columns.
The %object count (percentage object count) would not help, as the user
wouldn't know how to fill it, and we don't need to set PGs regarding the
global number of objects of all pools, right ?
What we try to avoid is having PG directory trees with several hundreds
of thousand of files in them as this is where the pain comes from
regarding scrubbing, snap trimming, recovering, etc.
Now the tool should not only consider the amount of datas but also the
number of objects to advise on the number of PGs to set on a pool.
I hope this gives clues to help redesigning the tool.
Best regards,
Frédéric.
Le 13/04/2017 à 19:08, David Turner a écrit :
I think what fits the need of Frédéric while not impacting the
complexity of the tool for new users would be a list of known
"gotchas" in PG counts. Like not having a Base2 count of PGs will
cause each PG to be variable sized (for each PG past the last Base2,
you have 2 PGs that are half the size of the others); Having less than
X number of PG's for so much data on your amount of OSDs will cause
balance problems; Having more than X number of objects for the PG's
selected will cause issues; Having more than X number of PG's per OSD
total (not just per pool) can cause high memory requirements (this is
especially important for people setting up multiple RGW zones); etc.
On Thu, Apr 13, 2017 at 12:58 PM Michael Kidd <linuxkidd@xxxxxxxxxx
<mailto:linuxkidd@xxxxxxxxxx>> wrote:
Hello Frédéric,
Thank you very much for the input. I would like to ask for some
feedback from you, as well as the ceph-users list at large.
The PGCalc tool was created to help steer new Ceph users in the
right direction, but it's certainly difficult to account for every
possible scenario. I'm struggling to find a way to implement
something that would work better for the scenario that you
(Frédéric) describe, while still being a useful starting point for
the novice / more mainstream use cases. I've also gotten
complaints at the other end of the spectrum, that the tool expects
the user to know too much already, so accounting for the number of
objects is bound to add to this sentiment.
As the Ceph user base expands and the use cases diverge, we are
definitely finding more edge cases that are causing pain. I'd
love to make something to help prevent these types of issues, but
again, I worry about the complexity introduced.
With this, I see a few possible ways forward:
* Simply re-wroding the %data to be % object count -- but this
seems more abstract, again leading to more confusion of new users.
* Increase complexity of the PG Calc tool, at the risk of further
alienating novice/mainstream users
* Add a disclaimer about the tool being a base for decision
making, but that certain edge cases require adjustments to the
recommended PG count and/or ceph.conf & sysctl values.
* Add a disclaimer urging the end user to secure storage
consulting if their use case falls into certain categories or they
are new to Ceph to ensure the cluster will meet their needs.
Having been on the storage consulting team and knowing the
expertise they have, I strongly believe that newcomers to Ceph (or
new use cases inside of established customers) should secure
consulting before final decisions are made on hardware... let
alone the cluster is deployed. I know it seems a bit self-serving
to make this suggestionas I work at Red Hat, but there is a lot on
the line when any establishment is storing potentially business
critical data.
I suspect the answer lies in a combination of the above or in
something I've not thought of.Please do weigh in as any and all
suggestions are more than welcome.
Thanks,
Michael J. Kidd
Principal Software Maintenance Engineer
Red Hat Ceph Storage
+1 919-442-8878 <tel:%28919%29%20442-8878>
On Wed, Apr 12, 2017 at 6:35 AM, Frédéric Nass
<frederic.nass@xxxxxxxxxxxxxxxx
<mailto:frederic.nass@xxxxxxxxxxxxxxxx>> wrote:
Hi,
I wanted to share a bad experience we had due to how the PG
calculator works.
When we set our production cluster months ago, we had to
decide on the number of PGs to give to each pool in the cluster.
As you know, the PG calc would recommended to give a lot of
PGs to heavy pools in size, regardless the number of objects
in the pools. How bad...
We essentially had 3 pools to set on 144 OSDs :
1. a EC5+4 pool for the radosGW (.rgw.buckets) that would hold
80% of all datas in the cluster. PG calc recommended 2048 PGs.
2. a EC5+4 pool for zimbra's data (emails) that would hold 20%
of all datas. PG calc recommended 512 PGs.
3. a replicated pool for zimbra's metadata (null size objects
holding xattrs - used for deduplication) that would hold 0% of
all datas. PG calc recommended 128 PGs, but we decided on 256.
With 120M of objects in pool #3, as soon as we upgraded to
Jewel, we hit the Jewel scrubbing bug (OSDs flapping).
Before we could upgrade to patched Jewel, scrub all the
cluster again prior to increasing the number of PGs on this
pool, we had to take more than a hundred of snapshots (for
backup/restoration purposes), with the number of objects still
increasing in the pool. Then when a snapshot was removed, we
hit the current Jewel snap trimming bug affecting pools with
too many objects for the number of PGs. The only way we could
stop the trimming was to stop OSDs resulting in PGs being
degraded and not trimming anymore (snap trimming only happens
on active+clean PGs).
We're now just getting out of this hole, thanks to Nick's post
regarding osd_snap_trim_sleep and RHCS support expertise.
If the PG calc had considered not only the pools weight but
also the number of expected objects in the pool (which we knew
by that time), we wouldn't have it these 2 bugs.
We hope this will help improving the ceph.com
<http://ceph.com> and RHCS PG calculators.
Regards,
Frédéric.
--
Frédéric Nass
Sous-direction Infrastructures
Direction du Numérique
Université de Lorraine
Tél : +33 3 72 74 11 35 <tel:%2B33%203%2072%2074%2011%2035>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html