Re: Setting correct PG num with multiple pools in play

Gregory Farnum <greg@xxxxxxxxxxx> · Thu, 14 Feb 2013 10:12:34 -0800

On Thu, Feb 14, 2013 at 9:59 AM, Travis Rhoden <trhoden@xxxxxxxxx> wrote:
>
> On Thu, Feb 14, 2013 at 12:21 PM, Sage Weil <sage@xxxxxxxxxxx> wrote:
>> On Thu, 14 Feb 2013, Travis Rhoden wrote:
>>> Hi folks,
>>>
>>> Looking at the docs at [1], I see the following advice:
>>>
>>> "When using multiple data pools for storing objects, you need to ensure
>>> that
>>> you balance the number of placement groups per pool with the number of
>>> placement groups per OSD so that you arrive at a reasonable total number
>>> of
>>> placement groups that provides reasonably low variance per OSD without
>>> taxing system resources or making the peering process too slow."
>>>
>>> Can someone expound on this a little bit more for me?  Does it mean that
>>> if
>>> I am going to create 3 or 4 pools, all being used heavily, that perhaps I
>>> should *not* go with the recommended value of PG = (#OSDs *
>>> 100)/replicas?
>>> For example, I have 60 OSDs.  With two replicas, that gives me 3000 PGs.
>>> I
>>> have read that there may be some benefit to using a power of two, so I
>>> was
>>> considering making this 4096.  If I do this for 3 or 4 pools, is that too
>>> much?  That's what I"m really missing -- how to know when my balance is
>>> off
>>> and I've really set up too many PGs, or too many PGs per OSD.
>>
>> That "PG" should probably read "total PGs".  So, device by 3 or 4.
>>
>> Unfortunately, though, there is a <facepalm> in the placement code that
>> makes the placement of PGs for different pools overlap heavily; that will
>> get fixed in cuttlefish.  So if the cluster is large, the data
>> distribution will degrade somewhat if there are lots of overlapping pools.
>> For now, I would recommend splitting the difference.
>>
> Ah, interesting!  I definitely did not pick up that that formula was giving
> you a target number for total PGs in the system, not per pool.  If that is
> the case, though, I have to question how the default sizes get picked when
> using mkcephfs.  In my 60 OSD example, the recommended number of PGs per the
> docs would be 3000, and indeed mkcephfs made the 3 default pools (2 copies)
> fairly close to that -- 3904.  But that is per-pool, and the overall number
> of PGs out of the box was 11712.  Based on your feedback above, isn't that a
> little high?
>
> I had already added two more pools with 3904 PGs each, and just added
> another with 4096.  That brings my total PG num to 23616 (almost 400 per
> OSD).  Hearing that the "total" PG count should be more like 3000 makes me
> worried that I have a lot of unnecessary overhead.  Thoughts?  Am I
> interpreting all this correctly?

Unfortunately, the number of PGs isn't really this cut and dry. The
recommendation for 100 per OSD is based on statistical tests of the
evenness of the data distribution across the cluster, but those tests
were all run when using only one pool in the cluster. If your pools
all see roughly the same amount of usage /and they had uncorrelated PG
placements/ then this distribution would roughly maintain itself if
you split up that 100 PGs/OSD across multiple pools. Unfortunately as
Sage mentioned the PG placements are currently correlated (whoops!).
Now, your OSDs should be able to handle quite a lot more than 100
PGs/OSD — Sam guesstimates that (modulo weird hardware configs) you
don't really run into trouble until each OSD is hosting in the
neighborhood of 5000 PGs (so 1600-2500 PGs/OSD with 3x or 2x
replication), so I'd bias for a per-pool count which is close to
100/OSD and then reduce if necessary to avoid getting ridiculous in
your total number.
Of course, the long-term vision once the PG functionality for merge is
written and split is a bit more baked is that the cluster will
auto-scale your PG counts based on the quality of the data
distribution and the amount of data in the PG.

Hope this clarifies the tradeoffs you're making a bit more! :)
-Greg
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com