Re: Advice on increasing pgs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



When you increase your PGs you're already going to be moving around all of your data.  Doing a full doubling of your PGs from 64 -> 128 -> 256 -> ... -> 2048 over and over and letting it backfill to healthy every time is a lot of extra data movement that isn't needed.

I would recommend setting osd_max_backfills to something that won't cripple your cluster (5 works decently for us), set the norecover, nobackfill, nodown, and noout flags, and then increase your pg_num and pgp_num slowly until you reach your target.  Depending on how much extra RAM you have in each of your storage nodes depends on how much you want to increase pg_num by at a time.  We don't do more than ~200 at a time.  When you reach your target and there is no more peering happening, then unset norecover, nobackfill, and nodown.  After you finish all of the backfilling, then unset noout.

You are likely to see slow/blocked requests in your cluster throughout this process, but the best thing is to get to the other side of increasing your pgs.  The official recommendation for increasing pgs is to plan ahead for the size of your cluster and start with that many pgs because this process is painful and will slow down your cluster until it's done.

Note, if you're increasing pgs from 2048 to 4096, then doing it in smaller chunks of 512 at a time could make sense because of how ceph treats pools with a non-base 2 number of pgs.  if you have 8 pgs that are 4GB and increase the number to 10 (a non-power of 2) then you will have 6 pgs that are 4GB and 4 pgs that are 2GB.  It splits them in half to fill up the number of pgs that aren't a power of 2.  If you went to 14 pgs, then you would have 2 pgs that are 4GB and 12 pgs that are 2GB.  Finally when you set it to 16 pgs you would have 16 pgs that are all 2GB.

So if you increase your PGs by less than a power of 2, then it will only work on  that number of pgs and leave the rest of them alone.  However in your scenario of going from 64 pgs to 2048, you are going to be affecting all of the PGs every time you split and buy yourself nothing by doing it in smaller chunks.  The reason to not just increase pg_num to 2048 is that when ceph creates each PG it has to peer and you can peer your osds into oblivion and lose access to all of your data for a while, that's why the recommendation to add them bit by bit with nodown, noout, nobackfill, and norecover set so that you get to the number you want and then can tell your cluster to start moving data.

From: ceph-users [ceph-users-bounces@xxxxxxxxxxxxxx] on behalf of Robin Percy [rpercy@xxxxxxxxx]
Sent: Monday, July 11, 2016 2:53 PM
To: ceph-users@xxxxxxxx
Subject: Advice on increasing pgs

Hello,

I'm looking for some advice on how to most safely increase the pgs in our primary ceph pool.

A bit of background: We're running ceph 0.80.9 and have a cluster of 126 OSDs with only 64 pgs allocated to the pool. As a result, 2 OSDs are now 88% full, while the pool is only showing as 6% used.

Based on my understanding, this is clearly a placement problem, so the plan is to increase to 2048 pgs. In order to avoid significant performance degradation, we'll be incrementing pg_num and pgp_num one power of two at a time and waiting for the cluster to rebalance before making the next increment.

My question is: are there any other steps we can take to minimize potential performance impact? And/or is there a way to model or predict the level of impact, based on cluster configuration, data placement, etc?

Thanks in advance for any answers,
Robin
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux