Re: Increasing number of PGs by not a factor of two?

Kai Wagner <kwagner@xxxxxxxx> · Thu, 17 May 2018 23:19:51 +0200



    Great summary David. Wouldn't this be worth a blog post?

    
    On 17.05.2018 20:36, David Turner
      wrote:

    
      By sticking with PG numbers as a base 2 number
        (1024, 16384, etc) all of your PGs will be the same size and
        easier to balance and manage.  What happens when you have a non
        base 2 number is something like this.  Say you have 4 PGs that
        are all 2GB in size.  If you increase pg(p)_num to 6, then you
        will have 2 PGs that are 2GB and 4 PGs that are 1GB as you've
        split 2 of the PGs into 4 to get to the 6 total.  If you
        increase the pg(p)_num to 8, then all 8 PGs will be 1GB. 
        Depending on how you manage your cluster, that doesn't really
        matter, but for some methods of balancing your cluster, that
        will greatly imbalance things.
        

        This would be a good time to go to a base 2 number.  I
          think you're thinking about Gluster where if you have 4 bricks
          and you want to increase your capacity, going to anything
          other than a multiple of 4 (8, 12, 16) kills performance
          (worse than increasing storage already does) and takes longer
          as it has to weirdly divide the data instead of splitting a
          single brick up to multiple bricks.
        

        As you increase your PGs, do this slowly and in a loop.  I
          like to increase my PGs by 256, wait for all PGs to create,
          activate, and peer, rinse/repate until I get to my target. 
          [1] This is an example of a script that should accomplish this
          with no interference.  Notice the use of flags while
          increasing the PGs.  It will make things take much longer if
          you have an OSD OOM itself or die for any reason by adding to
          the peering needing to happen.  It will also be wasted IO to
          start backfilling while you're still making changes; it's best
          to wait until you finish increasing your PGs and everything
          peers before you let data start moving.
        

        Another thing to keep in mind is how long your cluster will
          be moving data around.  Increasing your PG count on a pool
          full of data is one of the most intensive operations you can
          tell a cluster to do.  The last time I had to do this, I
          increased pg(p)_num by 4k PGs from 16k to 32k, let it
          backfill, rinse/repeat until the desired PG count was
          achieved.  For me, that 4k PGs would take 3-5 days depending
          on other cluster load and how full the cluster was.  If you do
          decide to increase your PGs by 4k instead of the full
          increase, change the 16384 to the number you decide to go to,
          backfill, continue. 
        

        [1]
        # Make sure to set pool variable as well as the number
          ranges to the appropriate values.
        flags="nodown nobackfill norecover"
        for flag in $flags; do
          ceph osd set $flag
        done
        pool=rbd
        echo "$pool currently has $(ceph osd pool get $pool pg_num)
          PGs"
        # The first number is your current PG count for the pool,
          the second number is the target PG count, and the third number
          is how many to increase it by each time through the loop.
        for num in {7700..16384..256}; do
          ceph osd pool set $pool pg_num $num
          while sleep 10; do
            ceph osd health | grep -q
          'peering\|stale\|activating\|creating\|inactive' || break
          done
          ceph osd pool set $pool pgp_num $num
          while sleep 10; do
            ceph osd health | grep -q
          'peering\|stale\|activating\|creating\|inactive' || break
          done
        done
        for flag in $flags; do
          ceph osd unset $flag
        done
      
      
        On Thu, May 17, 2018 at 9:27 AM Kai Wagner <kwagner@xxxxxxxx>
          wrote:

        
        Hi Oliver,

          
          a good value is 100-150 PGs per OSD. So in your case between
          20k and 30k.

          
          You can increase your PGs, but keep in mind that this will
          keep the

          cluster quite busy for some while. That said I would rather
          increase in

          smaller steps than in one large move.

          
          Kai

          
          On 17.05.2018 01:29, Oliver Schulz wrote:

          > Dear all,

          >

          > we have a Ceph cluster that has slowly evolved over
          several

          > years and Ceph versions (started with 18 OSDs and 54 TB

          > in 2013, now about 200 OSDs and 1.5 PB, still the same

          > cluster, with data continuity). So there are some

          > "early sins" in the cluster configuration, left over from

          > the early days.

          >

          > One of these sins is the number of PGs in our CephFS
          "data"

          > pool, which is 7200 and therefore not (as recommended)

          > a power of two. Pretty much all of our data is in the

          > "data" pool, the only other pools are "rbd" and
          "metadata",

          > both contain little data (and they have way too many PGs

          > already, another early sin).

          >

          > Is it possible - and safe - to change the number of
          "data"

          > pool PGs from 7200 to 8192 or 16384? As we recently added

          > more OSDs, I guess it would be time to increase the
          number

          > of PGs anyhow. Or would we have to go to 14400 instead of

          > 16384?

          >

          >

          > Thanks for any advice,

          >

          > Oliver

          > _______________________________________________

          > ceph-users mailing list

          > ceph-users@xxxxxxxxxxxxxx

          > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

          >

          
          -- 

          SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham
          Norton, HRB 21284 (AG Nürnberg)

          
          _______________________________________________

          ceph-users mailing list

          ceph-users@xxxxxxxxxxxxxx

          http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

        
    -- 
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
  

Attachment:
signature.asc

Description: OpenPGP digital signature
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com