Re: Increasing number of PGs by not a factor of two?

David Turner <drakonstein@xxxxxxxxx> · Thu, 17 May 2018 14:36:30 -0400

By sticking with PG numbers as a base 2 number (1024, 16384, etc) all of your PGs will be the same size and easier to balance and manage.  What happens when you have a non base 2 number is something like this.  Say you have 4 PGs that are all 2GB in size.  If you increase pg(p)_num to 6, then you will have 2 PGs that are 2GB and 4 PGs that are 1GB as you've split 2 of the PGs into 4 to get to the 6 total.  If you increase the pg(p)_num to 8, then all 8 PGs will be 1GB.  Depending on how you manage your cluster, that doesn't really matter, but for some methods of balancing your cluster, that will greatly imbalance things.
This would be a good time to go to a base 2 number.  I think you're thinking about Gluster where if you have 4 bricks and you want to increase your capacity, going to anything other than a multiple of 4 (8, 12, 16) kills performance (worse than increasing storage already does) and takes longer as it has to weirdly divide the data instead of splitting a single brick up to multiple bricks.

As you increase your PGs, do this slowly and in a loop.  I like to increase my PGs by 256, wait for all PGs to create, activate, and peer, rinse/repate until I get to my target.  [1] This is an example of a script that should accomplish this with no interference.  Notice the use of flags while increasing the PGs.  It will make things take much longer if you have an OSD OOM itself or die for any reason by adding to the peering needing to happen.  It will also be wasted IO to start backfilling while you're still making changes; it's best to wait until you finish increasing your PGs and everything peers before you let data start moving.

Another thing to keep in mind is how long your cluster will be moving data around.  Increasing your PG count on a pool full of data is one of the most intensive operations you can tell a cluster to do.  The last time I had to do this, I increased pg(p)_num by 4k PGs from 16k to 32k, let it backfill, rinse/repeat until the desired PG count was achieved.  For me, that 4k PGs would take 3-5 days depending on other cluster load and how full the cluster was.  If you do decide to increase your PGs by 4k instead of the full increase, change the 16384 to the number you decide to go to, backfill, continue. 

[1]
# Make sure to set pool variable as well as the number ranges to the appropriate values.
flags="nodown nobackfill norecover"
for flag in $flags; do
  ceph osd set $flag
done
pool=rbd
echo "$pool currently has $(ceph osd pool get $pool pg_num) PGs"
# The first number is your current PG count for the pool, the second number is the target PG count, and the third number is how many to increase it by each time through the loop.
for num in {7700..16384..256}; do
  ceph osd pool set $pool pg_num $num
  while sleep 10; do
    ceph osd health | grep -q 'peering\|stale\|activating\|creating\|inactive' || break
  done
  ceph osd pool set $pool pgp_num $num
  while sleep 10; do
    ceph osd health | grep -q 'peering\|stale\|activating\|creating\|inactive' || break
  done
done
for flag in $flags; do
  ceph osd unset $flag
done

On Thu, May 17, 2018 at 9:27 AM Kai Wagner <kwagner@xxxxxxxx> wrote:
Hi Oliver,

a good value is 100-150 PGs per OSD. So in your case between 20k and 30k.

You can increase your PGs, but keep in mind that this will keep the

cluster quite busy for some while. That said I would rather increase in

smaller steps than in one large move.

Kai

On 17.05.2018 01:29, Oliver Schulz wrote:

> Dear all,

>

> we have a Ceph cluster that has slowly evolved over several

> years and Ceph versions (started with 18 OSDs and 54 TB

> in 2013, now about 200 OSDs and 1.5 PB, still the same

> cluster, with data continuity). So there are some

> "early sins" in the cluster configuration, left over from

> the early days.

>

> One of these sins is the number of PGs in our CephFS "data"

> pool, which is 7200 and therefore not (as recommended)

> a power of two. Pretty much all of our data is in the

> "data" pool, the only other pools are "rbd" and "metadata",

> both contain little data (and they have way too many PGs

> already, another early sin).

>

> Is it possible - and safe - to change the number of "data"

> pool PGs from 7200 to 8192 or 16384? As we recently added

> more OSDs, I guess it would be time to increase the number

> of PGs anyhow. Or would we have to go to 14400 instead of

> 16384?

>

>

> Thanks for any advice,

>

> Oliver

> _______________________________________________

> ceph-users mailing list

> ceph-users@xxxxxxxxxxxxxx

> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>

-- 

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com