Re: PG explosion with erasure codes, power of two and "x pools have many more objects per pg than average"

Paul Emmerich <paul.emmerich@xxxxxxxx> · Fri, 25 May 2018 20:26:00 +0200

Answes inline.

2018-05-25 17:57 GMT+02:00 Jesus Cea <jcea@xxxxxxx>:
Hi there.

I have configured a POOL with a 8+2 erasure code. My target by space

usage and OSD configuration, would be 128 PG, but since each configure

PG will be using 10 actual "PGs", I have created the pool with only 8 PG

(80 real PG). Since I can increase PGs but not decreasing it, this

decision seems sensible.

Some questions:

1. Documentation insists everywhere that the PG could should be a power

of two. Would be nice to know the consequences of not following this

recommendation. Would be nice to know too if being "close" to a power of

two is better than be far away and if it is better to be close but below

or close but a little bit more. If ideal value is 128 but I only can be

120 or 130, what should I choose?. 120 or 130?. Why?

Go for the next larger power of two under the assumption that your cluster will grow.

2. As I understand, the PG count that should be "power of two" is "8",

in this case (real 80 PG underneath). Good. In this case, the next step

would be 16 (160 real PG). I would rather prefer to increase it to 12 or

13 (120/130 real PGs). Would it be reasonable?. What are the

consequences of increasing PG to 12 or 13 instead of choosing 16 (the

next power of two).

Data will be poorly balanced between PGs if it's not a power of two.

3. Is there any negative effect for CRUSH of using erasure code 8+2

instead of 6+2 or 14+2 (power of two)?. I have 25 OSDs, so requiring 16

for a single operation seems a bad idea, even more when my OSD

capacities are very spread (from 150 GB to 1TB) and filling a small OSD

would block writes in the entire pool.

EC rules don't have to be powers of two. And yes, too many chunks for
EC pools is a bad idea. It's rarely advisable to have a total of k + m larger
than 8 or so.

Also, you should have at least k + m + 1 servers, otherwise full server
failures cannot be handled properly.

A large spread between the OSD capacities within one crush rule is also
usually a bad idea, 150 GB to 1 TB is typically too big.

4. Since I have created a erasure coded pool with 8 PG, I am getting

warnings of "x pools have many more objects per pg than average". The

data I am copying is coming from a legacy pool with PG=512. New pool PG

is 8. That is creating ~30.000 objects per PG, far above average (616

objects). What can I do?. Moving to 16 or 32 PGs is not going to improve

the situation, but will consume PGs (32*10). Advice?.

Well, you reduced the number of PGs by a factor of 64, so you'll of course
see a large skew here. The option mon_pg_warn_max_object_skew 
controls when this warning is shown, default is 10.

5. I understand the advice of having <300 PGs per OSD because memory

usage, but I am wondering about the impact of the number of objects in

each PG. I wonder if memory and resource wise, having 100 PG with 10.000

objects each is far more demanding than 1000 PGs with 50 objects each.

Since I have PGs with 300 objects and PGs with 30.000 objects, I wonder

about the memory impact of each. What is the actual memory hungry factor

in a OSD, PGs or objects per PG?.

PGs typically impose a bigger overhead. But PGs with a large number of objects
can become annoying...

Paul

Thanks for your time and knowledge :).

-- 

Jesús Cea Avión                         _/_/      _/_/_/        _/_/_/

jcea@xxxxxxx - http://www.jcea.es/     _/_/    _/_/  _/_/    _/_/  _/_/

Twitter: @jcea                        _/_/    _/_/          _/_/_/_/_/

jabber / xmpp:jcea@xxxxxxxxxx  _/_/  _/_/    _/_/          _/_/  _/_/

"Things are not so easy"      _/_/  _/_/    _/_/  _/_/    _/_/  _/_/

"My name is Dump, Core Dump"   _/_/_/        _/_/_/      _/_/  _/_/

"El amor es poner tu felicidad en la felicidad de otro" - Leibniz

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com