Re: [EXTERNAL] Re: Increase PG number

David Turner <david.turner@xxxxxxxxxxxxxxxx> · Mon, 19 Sep 2016 15:45:53 +0000

We regretably have to increase PG's in a ceph cluster this way more often than anyone should ever need to.  As such, we have scripted it out.  A basic version of the script that
 should work for you is below.

First, create a function to check for any pg states that you don't want to continue if any pgs are in them (better than duplicating code).  Second, set the flags so your cluster doesn't die while you do this.  Third, set your numbers of current PGs and the
 destination PGs for the for loop.  The Loop will ignore any number not divisible by 256.  As you've found, increasing by 256 is a good number.  More than that and you'll run into issues of your cluster curling into a fetal position and crying.  This will loop
 through increasing your pg_num, wait until everything is settled, then increase your pgp_num.  The seemingly excessive sleeps are to help the cluster be able to resolve blocked requests that will still happen during this.  Lastly unset the flags to let the
 cluster start moving the data around.

One thing to note, in a cluster with 800-1000 HDD OSDS with SSD journals, going from 16k to 32k PGs, We set maxbackfills to 1 during busy times and 2 during idle times.  maxbackfills of more than 2 is not beneficial for us to increasing our pg count.  We have
 tested maxbackfills of 2 and 5, both took the entire weekend to add 4k PGs.  We also do not add all of the PGs at once.  We do 4k each weekend and 2k during the week waiting for the cluster to finish each time to give our mon stores a chance to compact before
 we continue.

check_health(){

#If this finds any of the strings in the grep, then it will return 0, otherwise it will return 1 (or whatever the grep return code is)

    ceph health | grep 'peering\|stale\|activating\|creating\|down' > /dev/null

    return $?

}

for flag in nobackfill norecover noout nodown

do

    ceph osd set $flag

done

#Set your current and destination pg counts here.

for num in {2048..16384}

do

    [ $(( $i % 256 )) -eq 0 ] || continue

    while sleep 10

    do

        check_health

        if [ $? -ne 0 ]

        then

#This assumes your pool is named rbd

            ceph osd pool set rbd pg_num $num

            break

        fi

    done

    sleep 60

    while sleep 10

    do

        check_health

        if [ $? -ne 0 ]

        then

#This assumes your pool is named rbd

            ceph osd pool set rbd pgp_num $num

            break

        fi

    done

    sleep 60

done

for flag in nobackfill norecover noout nodown

do

    ceph osd unset $flag

done

David Turner |
Cloud Operations Engineer |
StorageCraft
 Technology Corporation

380 Data Drive Suite 300 |
Draper |
Utah |
84020

Office:
801.871.2760 |
Mobile:
385.224.2943

If you are not the intended recipient of this message or received it erroneously, please notify the sender and delete it, together with any attachments, and be advised that any dissemination or copying of this
 message is prohibited.

From: ceph-users [ceph-users-bounces@xxxxxxxxxxxxxx] on behalf of Matteo Dacrema [mdacrema@xxxxxxxx]

Sent: Monday, September 19, 2016 2:51 AM

To: Will.Boege; ceph-users@xxxxxxxxxxxxxx

Subject: Re: [ceph-users] [EXTERNAL] Re: Increase PG number

Hi,

I’ve 3 different cluster.
The first I’ve been able to upgrade from 1024 to 2048 pgs with 10 minutes of "io freeze”.
The second I’ve been able to upgrade from 368 to 512 in a sec without any performance issue, but from 512 to 1024 it take over 20 minutes to create pgs.
The third I’ve to upgrade is now 2048 pgs and I’ve to take it to 16384. So what I’m wondering is how to do it with minimum performance impact.

Maybe the best way is to upgrade by 256 to 256 pg and pgp num each time letting the cluster to rebalance every time.

Thanks
Matteo

This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you have received this email in
 error please notify the system manager. This message contains confidential information and is intended only for the individual named. If you are not the named addressee you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately
 by e-mail if you have received this e-mail by mistake and delete this e-mail from your system. If you are not the intended recipient you are notified that disclosing, copying, distributing or taking any action in reliance on the contents of this information
 is strictly prohibited.

Il giorno 19 set 2016, alle ore 05:22, Will.Boege <Will.Boege@xxxxxxxxxx> ha scritto:

How many PGs do you have - and how many are you increasing it to? 

Increasing PG counts can be disruptive if you are increasing by a large proportion of the initial count because all the PG peering involved.  If you are doubling the amount of PGs it might be good to do it in stages to minimize peering.  For example
 if you are going from 1024 to 2048 - consider 4 increases of 256, allowing the cluster to stabilize in-between, rather that one event that doubles the number of PGs. 

If you expect this cluster to grow, overshoot the recommended PG count by 50% or so.  This will allow you to minimize the PG increase events, and thusly impact to your users.  

From: ceph-users <ceph-users-bounces@xxxxxxxxxxxxxx> on behalf of Matteo Dacrema <mdacrema@xxxxxxxx>

Date: Sunday, September 18, 2016 at 3:29 PM

To: Goncalo Borges <goncalo.borges@xxxxxxxxxxxxx>, "ceph-users@xxxxxxxxxxxxxx"
 <ceph-users@xxxxxxxxxxxxxx>

Subject: [EXTERNAL] Re: [ceph-users] Increase PG number

Hi , thanks for your reply.

Yes, I’don’t any near full osd.

The problem is not the rebalancing process but the process of creation of new pgs.

I’ve only 2 host running Ceph Firefly version with 3 SSDs for journaling each.
During the creation of new pgs all the volumes attached stop to read or write showing high iowait.
Ceph -s tell me that there are thousand of slow requests.

When all the pgs are created slow request begin to decrease and the cluster start rebalancing process.

Matteo

This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you have received this email in
 error please notify the system manager. This message contains confidential information and is intended only for the individual named. If you are not the named addressee you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately
 by e-mail if you have received this e-mail by mistake and delete this e-mail from your system. If you are not the intended recipient you are notified that disclosing, copying, distributing or taking any action in reliance on the contents of this information
 is strictly prohibited.

Il giorno 18 set 2016, alle ore 13:08, Goncalo Borges <goncalo.borges@xxxxxxxxxxxxx> ha scritto:

Hi

I am assuming that you do not have any near full osd  (either before or along the pg splitting process) and that your cluster is healthy.

To minimize the impact on the clients during recover or operations like pg splitting, it is good to set the following configs. Obviously the whole operation will take longer to recover but the impact on clients will be minimized.

#  ceph daemon mon.rccephmon1 config show | egrep "(osd_max_backfills|osd_recovery_threads|osd_recovery_op_priority|osd_client_op_priority|osd_recovery_max_active)"

   "osd_max_backfills": "1",

   "osd_recovery_threads": "1",

   "osd_recovery_max_active": "1"

   "osd_client_op_priority": "63",

   "osd_recovery_op_priority": "1"

Cheers

G.

________________________________________

From: ceph-users [ceph-users-bounces@xxxxxxxxxxxxxx] on behalf of Matteo Dacrema [mdacrema@xxxxxxxx]

Sent: 18 September 2016 03:42

To: ceph-users@xxxxxxxxxxxxxx

Subject: [ceph-users] Increase PG number

Hi All,

I need to expand my ceph cluster and I also need to increase pg number.

In a test environment I see that during pg creation all read and write operations are stopped.

Is that a normal behavior ?

Thanks

Matteo

This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you have received this email in error please notify the system manager. This message contains confidential
 information and is intended only for the individual named. If you are not the named addressee you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately by e-mail if you have received this e-mail by mistake and delete
 this e-mail from your system. If you are not the intended recipient you are notified that disclosing, copying, distributing or taking any action in reliance on the contents of this information is strictly prohibited.

--

Questo messaggio e' stato analizzato con Libra ESVA ed e' risultato non infetto.

Seguire il link qui sotto per segnalarlo come spam: 

http://mx01.enter.it/cgi-bin/learn-msg.cgi?id=D6CF2401EE.A1426

-- 

Questo messaggio e' stato analizzato con Libra ESVA ed e' risultato non infetto.

Clicca qui per segnalarlo come spam.

Clicca qui per metterlo in blacklist

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com