Re: PG explosion with erasure codes, power of two and "x pools have many more objects per pg than average"

Jesus Cea <jcea@xxxxxxx> · Sat, 26 May 2018 00:47:26 +0200

On 25/05/18 20:26, Paul Emmerich wrote:
> Answes inline.
> 
>> 2018-05-25 17:57 GMT+02:00 Jesus Cea <jcea@xxxxxxx
>> <mailto:jcea@xxxxxxx>>: recommendation. Would be nice to know too if
>> being "close" to a power of two is better than be far away and if it
>> is better to be close but below or close but a little bit more. If
>> ideal value is 128 but I only can be 120 or 130, what should I
>> choose?. 120 or 130?. Why?
> 
> Go for the next larger power of two under the assumption that your 
> cluster will grow.

I now know better. Check my other emails.

Not being a power of two always creates imbalance. You can not overcome
that.

If your are close to a power of two but under it (120), most of your PG
will be "X" in size, a few of your PG will be "2*X" in size.

If your are close to a power of two but over it (130), most of your PG
will be size "X" and few of them will be of size "X/2".

>> 3. Is there any negative effect for CRUSH of using erasure code 8+2 
>> instead of 6+2 or 14+2 (power of two)?. I have 25 OSDs, so requiring
>> 16 for a single operation seems a bad idea, even more when my OSD 
>> capacities are very spread (from 150 GB to 1TB) and filling a small
>> OSD would block writes in the entire pool.
> 
> EC rules don't have to be powers of two. And yes, too many chunks
> for EC pools is a bad idea. It's rarely advisable to have a total of
> k + m larger than 8 or so.

I verified it. My objects are 4MB fixed size, inmutable (no rewrites),
so each OSD provides 512 Kbytes. Seems nice. I could even use wider EC
codes, in my personal environment.

If your objects are small, requests per OSD will be tiny and performance
will suffer. You would better use narrower EC codes.

> Also, you should have at least k + m + 1 servers, otherwise full
> server failures cannot be handled properly.

Good advice, of course. "crush-failure-domain=host" (or bigger failure
domain) is also important, if you have enough resources.

> A large spread between the OSD capacities within one crush rule is
> also usually a bad idea, 150 GB to 1 TB is typically too big.

I know. Legacy sins. I spend my days reweighting.

> Well, you reduced the number of PGs by a factor of 64, so you'll of
> course see a large skew here. The option mon_pg_warn_max_object_skew 
> controls when this warning is shown, default is 10.

So you are advising me to increase that value to silent the warning?.

What I am thinking is that mixing in the same cluster regular replicated
pools with EC pools will always generate this "warning". It is almost a
natural effect.

>> What is the actual memory hungry factor in a OSD, PGs or objects
>> per PG?.
> 
> PGs typically impose a bigger overhead. But PGs with a large number
> of objects can become annoying...

I find this difficult to believe, but you are far more experience with
Ceph than me. Do you have any reference I can learn the details from?.
Beside source code :-).

Using EC will inevitably create PG with large number of objects.

My pools have around 240.000 4MB immutable objects (~1 TB). A replicated
pool would be configured as 128 PG, each PG having 1.875 objects, 7.5GB.

The same pool using EC 8+2 would use 13 PG (internally it would use 130
"pseudo PG", close to the original 128). Spare me the power of two rule
for now. 240.000 objects in 13 PG is 18.461 objects per PG, 92GB
(74*10/8) (internally it will stored in 10 OSD, each providing 9.2GB
each). I am actually using 8 PGs, so in my configuration it is more in
the 30.000 objects per PG range, 150GB per PG, 15GB per OSD per PG.

This compares badly with the original 1.875 objects by PG, although each
OSD used to take care of 7.5 GB and now it only grew to 15GB.

Is 30.000 objects per PG an issue?. What price am I paying here?

Can I do something to improve the situation?. Increasing PG_num to 16
will be better, but not too much, and going to 32 will push the PG count
per OSD well over the <500 PGs per OSD advice, considering that I have
quite a few of those EC pools.

Advice?

Thank!.

-- 
Jesús Cea Avión                         _/_/      _/_/_/        _/_/_/
jcea@xxxxxxx - http://www.jcea.es/     _/_/    _/_/  _/_/    _/_/  _/_/
Twitter: @jcea                        _/_/    _/_/          _/_/_/_/_/
jabber / xmpp:jcea@xxxxxxxxxx  _/_/  _/_/    _/_/          _/_/  _/_/
"Things are not so easy"      _/_/  _/_/    _/_/  _/_/    _/_/  _/_/
"My name is Dump, Core Dump"   _/_/_/        _/_/_/      _/_/  _/_/
"El amor es poner tu felicidad en la felicidad de otro" - Leibniz

Attachment:
signature.asc

Description: OpenPGP digital signature
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com