Re: increasing PG count - limiting disruption

David Turner <drakonstein@xxxxxxxxx> · Thu, 14 Nov 2019 12:18:16 -0500

There are a few factors to consider. I've gone from 16k pgs to 32k pgs before and learned some lessons.
The first and most imminent is the peering that happens when you increase the PG count. I like to increase the pg_num and pgp_num values slowly to mitigate this. Something like [1] this should do the trick to increase your pg count slowly and waiting for all peering and such to finish before continuing. It will also wait for a few other statuses that you shouldn't be doing maintenance like this during.

The second is that mons do not compact their databases while a pg is in a non-"clean" state. That means that while your cluster is creating these new PGs and moving data around, that your mon stores will grow with new maps until everything is healthy again. This is desired behavior to keep everything healthy in Ceph in the face of failures, BUT it means that you need to be aware of how much space you have on your mons for the mon store to grow. When I was increasing from 16k to 32k PGs, that means we could only create 4k PGs at a time. In that cluster that would take about 2 weeks to finish. When we tried to do more than that, our mons ran out of space and we had to add disks to the mons to move the mon stores to so that the mons could continue to run.

Finally know that this is just going to take a while (depending on how much data is in your cluster and how full it is). Be patient. Either you increase max_backfills, lower backfill sleep, and such to make the backfilling go faster (at the cost of IOPS used here that the clients can't) or you keep these throttled to not impact clients as much. Keep a good balance though as putting off finishing the recovery for too long leaves your cluster in a riskier position for that much longer.

Good luck.

[1] *Note that I typed this in gmail and not copied from a script. Please test before using.
ceph osd set nobackfill
ceph osd set norebalance
function healthy_wait() {
  while ceph health | grep -q 'peering\|inactive\|activating\|creating\|down\|inconsistent\|stale'; do
    echo waiting for ceph to be healthier
    sleep 10
  done
}
for count in {2048..4096..256}; do
  healthy_wait
  ceph osd pool set $pool pg_num $count
  healthy_wait
  ceph osd pool set $pool pgp_num $count
done
healthy_wait
ceph osd unset nobackfill
ceph osd unset norebalance

On Thu, Nov 14, 2019 at 11:19 AM Frank R <frankaritchie@xxxxxxxxx> wrote:
Hi all,

When increasing the number of placement groups for a pool by a large amount (say 2048 to 4096) is it better to go in small steps or all at once?

This is a filestore cluster.

Thanks,
Frank
_______________________________________________

ceph-users mailing list -- ceph-users@xxxxxxx

To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx