Re: Increasing number of PGs by not a factor of two?

Jesus Cea <jcea@xxxxxxx> · Fri, 25 May 2018 23:50:44 +0200

OK, I am writing this so you don't waste your time correcting me. I beg
your pardon.

On 25/05/18 18:28, Jesus Cea wrote:
> So, if I understand correctly, ceph tries to do the minimum splits. If
> you increase PG from 8 to 12, it will split 4 PGs and leave the other 4
> PGs alone, creating an imbalance.
> 
> According to that, would be far more advisable to create the pool with
> 12 PGs from the very beginning.
> 
> If I understand correctly, then, the advice of "power of two" is an
> oversimplification. The real advice would be: you better double your PG
> when you increase the PG count. That is: 12->24->48->96... Not real need
> for power of two.

Instead of trying to be smart, I just spend a few hours build a Ceph
experiment myself, testing different scenarios, PG resizing and such.

The "power of two" rule is law.

If you don't follow it, some PG will be contain double number of objects
than others.

The rule is something like this:

Lets say, your PG_num:

2^n-1 < PG_num <=2^n

The object name you created is hashed. "n" bits of it are considered.
Let's say that number is x.

If x < PG_num, your object will be stored under PG number x.

If x >= PG_num, then drop a bit (you use "n-1" bits) and it will be the
PG that will store your object.

This algorithm says that if your PG_num is not a "power of two", some of
your PG will be double size.

For instance, suppose PG_num = 13: (first number is the "x" of your
object, the second number is the PG used to store it)

0 -> 0    1 -> 1    2 -> 2    3 -> 3
4 -> 4    5 -> 5    6 -> 6    7 -> 7
8 -> 8    9 -> 9   10 -> 10  11 -> 11
12 -> 12

Good so far. But now:

13 -> 5  14 -> 6   15 -> 7

So, PGs 0-4 and 8-12 will store "COUNT" objects, but  PGs 5, 6 and 7
will store "2*COUNT" objects. PGs 5, 6 and 7 have twice the probability
of store your object.

Interestingly, The maximum object count difference between the biggest
PG and the smaller PG will be a factor of TWO. Statistically.

How important is that PG sizes are the same is something I am not sure
to understand.

> Also, a bad split is not important if the pool creates/destroys objects
> constantly, because new objects will be spread evenly. This could be an
> approach to rebalance a badly expanded pool: just copy & rename your
> objects (I am thinking about cephfs).
> 
> What am I saying makes sense?.

I answer to myself.

No, fool, it doesn't make sense. Ceph doesn't work that way. The PG
allocation is far more simpler and scalable, but also more dump. The
imbalance only depends of the number of PGs (should be a power of two),
not the process to get there.

The described idea doesn't work because if the PG numbers is not a power
of two, some PGs just simple get twice the lottery tickets and will get
double number of objects.  Copying, moving, replacing objects will not
change that.

> How Ceph decide what PG to split?. Per PG object count or by PG byte size?.

Following the algorithm described at the top of this post, Ceph will
just simply split PGs by increasing order. If my PG_num is 13 and I
increase it to 14, Ceph will split PG 5. Fully deterministic and
unrelated to size or how many objects are stored in that PG.

Since the association of an object to a PG is based in the hash of the
object name, we would expect every PG to have (statistically) the same
number of objects. Object size is not used here, so a huge object will
create a huge PG. This is a well known problem in Ceph (a few large
object will imbalance your cluster).

> Thank for your post. It deserves to be a blog!.

The original post was great. My reply was lame. I was just too smart for
my own good :).

Sorry for wasting your time.

-- 
Jesús Cea Avión                         _/_/      _/_/_/        _/_/_/
jcea@xxxxxxx - http://www.jcea.es/     _/_/    _/_/  _/_/    _/_/  _/_/
Twitter: @jcea                        _/_/    _/_/          _/_/_/_/_/
jabber / xmpp:jcea@xxxxxxxxxx  _/_/  _/_/    _/_/          _/_/  _/_/
"Things are not so easy"      _/_/  _/_/    _/_/  _/_/    _/_/  _/_/
"My name is Dump, Core Dump"   _/_/_/        _/_/_/      _/_/  _/_/
"El amor es poner tu felicidad en la felicidad de otro" - Leibniz

Attachment:
signature.asc

Description: OpenPGP digital signature
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com