OK, I am writing this so you don't waste your time correcting me. I beg your pardon. On 25/05/18 18:28, Jesus Cea wrote: > So, if I understand correctly, ceph tries to do the minimum splits. If > you increase PG from 8 to 12, it will split 4 PGs and leave the other 4 > PGs alone, creating an imbalance. > > According to that, would be far more advisable to create the pool with > 12 PGs from the very beginning. > > If I understand correctly, then, the advice of "power of two" is an > oversimplification. The real advice would be: you better double your PG > when you increase the PG count. That is: 12->24->48->96... Not real need > for power of two. Instead of trying to be smart, I just spend a few hours build a Ceph experiment myself, testing different scenarios, PG resizing and such. The "power of two" rule is law. If you don't follow it, some PG will be contain double number of objects than others. The rule is something like this: Lets say, your PG_num: 2^n-1 < PG_num <=2^n The object name you created is hashed. "n" bits of it are considered. Let's say that number is x. If x < PG_num, your object will be stored under PG number x. If x >= PG_num, then drop a bit (you use "n-1" bits) and it will be the PG that will store your object. This algorithm says that if your PG_num is not a "power of two", some of your PG will be double size. For instance, suppose PG_num = 13: (first number is the "x" of your object, the second number is the PG used to store it) 0 -> 0 1 -> 1 2 -> 2 3 -> 3 4 -> 4 5 -> 5 6 -> 6 7 -> 7 8 -> 8 9 -> 9 10 -> 10 11 -> 11 12 -> 12 Good so far. But now: 13 -> 5 14 -> 6 15 -> 7 So, PGs 0-4 and 8-12 will store "COUNT" objects, but PGs 5, 6 and 7 will store "2*COUNT" objects. PGs 5, 6 and 7 have twice the probability of store your object. Interestingly, The maximum object count difference between the biggest PG and the smaller PG will be a factor of TWO. Statistically. How important is that PG sizes are the same is something I am not sure to understand. > Also, a bad split is not important if the pool creates/destroys objects > constantly, because new objects will be spread evenly. This could be an > approach to rebalance a badly expanded pool: just copy & rename your > objects (I am thinking about cephfs). > > What am I saying makes sense?. I answer to myself. No, fool, it doesn't make sense. Ceph doesn't work that way. The PG allocation is far more simpler and scalable, but also more dump. The imbalance only depends of the number of PGs (should be a power of two), not the process to get there. The described idea doesn't work because if the PG numbers is not a power of two, some PGs just simple get twice the lottery tickets and will get double number of objects. Copying, moving, replacing objects will not change that. > How Ceph decide what PG to split?. Per PG object count or by PG byte size?. Following the algorithm described at the top of this post, Ceph will just simply split PGs by increasing order. If my PG_num is 13 and I increase it to 14, Ceph will split PG 5. Fully deterministic and unrelated to size or how many objects are stored in that PG. Since the association of an object to a PG is based in the hash of the object name, we would expect every PG to have (statistically) the same number of objects. Object size is not used here, so a huge object will create a huge PG. This is a well known problem in Ceph (a few large object will imbalance your cluster). > Thank for your post. It deserves to be a blog!. The original post was great. My reply was lame. I was just too smart for my own good :). Sorry for wasting your time. -- Jesús Cea Avión _/_/ _/_/_/ _/_/_/ jcea@xxxxxxx - http://www.jcea.es/ _/_/ _/_/ _/_/ _/_/ _/_/ Twitter: @jcea _/_/ _/_/ _/_/_/_/_/ jabber / xmpp:jcea@xxxxxxxxxx _/_/ _/_/ _/_/ _/_/ _/_/ "Things are not so easy" _/_/ _/_/ _/_/ _/_/ _/_/ _/_/ "My name is Dump, Core Dump" _/_/_/ _/_/_/ _/_/ _/_/ "El amor es poner tu felicidad en la felicidad de otro" - Leibniz
Attachment:
signature.asc
Description: OpenPGP digital signature
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com