By sticking with PG numbers as a base 2 number
(1024, 16384, etc) all of your PGs will be the same size and
easier to balance and manage. What happens when you have a non
base 2 number is something like this. Say you have 4 PGs that
are all 2GB in size. If you increase pg(p)_num to 6, then you
will have 2 PGs that are 2GB and 4 PGs that are 1GB as you've
split 2 of the PGs into 4 to get to the 6 total. If you
increase the pg(p)_num to 8, then all 8 PGs will be 1GB.
Depending on how you manage your cluster, that doesn't really
matter, but for some methods of balancing your cluster, that
will greatly imbalance things.
This would be a good time to go to a base 2 number. I
think you're thinking about Gluster where if you have 4 bricks
and you want to increase your capacity, going to anything
other than a multiple of 4 (8, 12, 16) kills performance
(worse than increasing storage already does) and takes longer
as it has to weirdly divide the data instead of splitting a
single brick up to multiple bricks.
As you increase your PGs, do this slowly and in a loop. I
like to increase my PGs by 256, wait for all PGs to create,
activate, and peer, rinse/repate until I get to my target.
[1] This is an example of a script that should accomplish this
with no interference. Notice the use of flags while
increasing the PGs. It will make things take much longer if
you have an OSD OOM itself or die for any reason by adding to
the peering needing to happen. It will also be wasted IO to
start backfilling while you're still making changes; it's best
to wait until you finish increasing your PGs and everything
peers before you let data start moving.
Another thing to keep in mind is how long your cluster will
be moving data around. Increasing your PG count on a pool
full of data is one of the most intensive operations you can
tell a cluster to do. The last time I had to do this, I
increased pg(p)_num by 4k PGs from 16k to 32k, let it
backfill, rinse/repeat until the desired PG count was
achieved. For me, that 4k PGs would take 3-5 days depending
on other cluster load and how full the cluster was. If you do
decide to increase your PGs by 4k instead of the full
increase, change the 16384 to the number you decide to go to,
backfill, continue.
[1]
# Make sure to set pool variable as well as the number
ranges to the appropriate values.
flags="nodown nobackfill norecover"
for flag in $flags; do
ceph osd set $flag
done
pool=rbd
echo "$pool currently has $(ceph osd pool get $pool pg_num)
PGs"
# The first number is your current PG count for the pool,
the second number is the target PG count, and the third number
is how many to increase it by each time through the loop.
for num in {7700..16384..256}; do
ceph osd pool set $pool pg_num $num
while sleep 10; do
ceph osd health | grep -q
'peering\|stale\|activating\|creating\|inactive' || break
done
ceph osd pool set $pool pgp_num $num
while sleep 10; do
ceph osd health | grep -q
'peering\|stale\|activating\|creating\|inactive' || break
done
done
for flag in $flags; do
ceph osd unset $flag
done