Hi,
I wanted to share a bad experience we had due to how the PG calculator
works.
When we set our production cluster months ago, we had to decide on the
number of PGs to give to each pool in the cluster.
As you know, the PG calc would recommended to give a lot of PGs to heavy
pools in size, regardless the number of objects in the pools. How bad...
We essentially had 3 pools to set on 144 OSDs :
1. a EC5+4 pool for the radosGW (.rgw.buckets) that would hold 80% of
all datas in the cluster. PG calc recommended 2048 PGs.
2. a EC5+4 pool for zimbra's data (emails) that would hold 20% of all
datas. PG calc recommended 512 PGs.
3. a replicated pool for zimbra's metadata (null size objects holding
xattrs - used for deduplication) that would hold 0% of all datas. PG
calc recommended 128 PGs, but we decided on 256.
With 120M of objects in pool #3, as soon as we upgraded to Jewel, we hit
the Jewel scrubbing bug (OSDs flapping).
Before we could upgrade to patched Jewel, scrub all the cluster again
prior to increasing the number of PGs on this pool, we had to take more
than a hundred of snapshots (for backup/restoration purposes), with the
number of objects still increasing in the pool. Then when a snapshot was
removed, we hit the current Jewel snap trimming bug affecting pools with
too many objects for the number of PGs. The only way we could stop the
trimming was to stop OSDs resulting in PGs being degraded and not
trimming anymore (snap trimming only happens on active+clean PGs).
We're now just getting out of this hole, thanks to Nick's post regarding
osd_snap_trim_sleep and RHCS support expertise.
If the PG calc had considered not only the pools weight but also the
number of expected objects in the pool (which we knew by that time), we
wouldn't have it these 2 bugs.
We hope this will help improving the ceph.com and RHCS PG calculators.
Regards,
Frédéric.
--
Frédéric Nass
Sous-direction Infrastructures
Direction du Numérique
Université de Lorraine
Tél : +33 3 72 74 11 35
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html