Re: Increasing number of PGs by not a factor of two?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



If you start your pool with 12 PGs, 4 of them will have double the size of the other 8.  It is 100% based on a power of 2 and has absolutely nothing to do with the number you start with vs the number you increase to.  If your PG count is not a power of 2 then you will have 2 different sizes of PGs with some being double the size of the others.

When increasing your PG count, ceph chooses which PGs to split in half based on the pg name, not with how big the PG is or how many objects it has.  The PG names are based on how many PGs you have in your pool and are perfectly evenly split when and only if your PG count is a power of 2.

Once upon a time I started a 2 rack cluster with 12,000 PGs.  All data was in 1 pool and I attempted to balance the cluster by making sure that every OSD in the cluster was within 2 PGs of each other.  That is to say that if the average PGs per OSD was 100, then no OSD had more than 101 PGs for that pool and no OSD had less than 99 PGs.  My tooling made this possible and is how we balanced our other clusters.  The resulting balance in this cluster was AWFUL!!!  Digging in I found that some of the PGs were twice as big as the other PGs.  It was actually very mathematical in how many.  Of the 12,000 PGs 4,384 PGs were twice as big as the remaining 7,616.  We increased the PG count in the pool to 16,384 and all of the PGs were the same in size when the backfilling finished.

On Fri, May 25, 2018 at 12:48 PM Jesus Cea <jcea@xxxxxxx> wrote:
On 17/05/18 20:36, David Turner wrote:
> By sticking with PG numbers as a base 2 number (1024, 16384, etc) all of
> your PGs will be the same size and easier to balance and manage.  What
> happens when you have a non base 2 number is something like this.  Say
> you have 4 PGs that are all 2GB in size.  If you increase pg(p)_num to
> 6, then you will have 2 PGs that are 2GB and 4 PGs that are 1GB as
> you've split 2 of the PGs into 4 to get to the 6 total.  If you increase
> the pg(p)_num to 8, then all 8 PGs will be 1GB.  Depending on how you
> manage your cluster, that doesn't really matter, but for some methods of
> balancing your cluster, that will greatly imbalance things.

So, if I understand correctly, ceph tries to do the minimum splits. If
you increase PG from 8 to 12, it will split 4 PGs and leave the other 4
PGs alone, creating an imbalance.

According to that, would be far more advisable to create the pool with
12 PGs from the very beginning.

If I understand correctly, then, the advice of "power of two" is an
oversimplification. The real advice would be: you better double your PG
when you increase the PG count. That is: 12->24->48->96... Not real need
for power of two.

Also, a bad split is not important if the pool creates/destroys objects
constantly, because new objects will be spread evenly. This could be an
approach to rebalance a badly expanded pool: just copy & rename your
objects (I am thinking about cephfs).

What am I saying makes sense?.

How Ceph decide what PG to split?. Per PG object count or by PG byte size?.

Thank for your post. It deserves to be a blog!.

--
Jesús Cea Avión                         _/_/      _/_/_/        _/_/_/
jcea@xxxxxxx - http://www.jcea.es/     _/_/    _/_/  _/_/    _/_/  _/_/
Twitter: @jcea                        _/_/    _/_/          _/_/_/_/_/
jabber / xmpp:jcea@xxxxxxxxxx  _/_/  _/_/    _/_/          _/_/  _/_/
"Things are not so easy"      _/_/  _/_/    _/_/  _/_/    _/_/  _/_/
"My name is Dump, Core Dump"   _/_/_/        _/_/_/      _/_/  _/_/
"El amor es poner tu felicidad en la felicidad de otro" - Leibniz

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux