Re: PG num calculator live on Ceph.com

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Excellent, thanks for the detailed breakdown.

Take care,
Bill

From: Michael J. Kidd [michael.kidd@xxxxxxxxxxx]
Sent: Wednesday, January 07, 2015 4:50 PM
To: Sanders, Bill
Cc: Loic Dachary; ceph-users@xxxxxxxx
Subject: Re: PG num calculator live on Ceph.com

Hello Bill,
  Either 2048 or 4096 should be acceptable.  4096 gives about a 300 PG per OSD ratio, which would leave room for tripling the OSD count without needing to increase the PG number.  While 2048 gives about 150 PGs per OSD, not leaving room but for about a 50% OSD count expansion.

The high PG count per OSD issue really doesn't manifest aggressively until you get around 1000 PGs per OSD and beyond.  At those levels, steady state operation continues without issue.. but recovery within the cluster will see the memory utilization of the OSDs climb and could push into out of memory conditions on the OSD host (or at a minimum, heavy swap usage if enabled).  It still depends of course on the # of OSDs per node, and the amount of memory on the node as to if you'll actually experience issues or not.

As an example though, I worked on a cluster which was about 5500 PGs per OSD.  The cluster experienced a network config issue in the switchgear which isolated 2/3's of the OSD nodes from each other and the other 1/3 of the cluster.  When the network issue was cleared, the OSDs started dropping like flies... They'd start up, spool up the memory they needed for map update parsing, and get killed before making any real headway.  We were finally able to get the cluster online by limiting what the OSDs were doing to a small slice of the normal start-up, waiting for the OSDs to calm down, then opening up a bit more for them to do (noup, noin, norecover, nobackfill, pause, noscrub, nodeep-scrub were all set, and then unset one at a time until all OSDs were up/in and able to handle the recovery).

6 weeks later, that same cluster lost about 40% of the OSDs during a power outage due to corruption from an HBA bug.. (it didn't flush the write cache to disk).  This pushed the PG per OSD count over 9000!!  It simply couldn't recover with the available memory at that PG count.  Each OSD, started by itself, would consume > 60gb of RAM and get killed (the nodes only had 64gb total).

While this is an extreme example... we see cases generated with > 1000 PGs per OSD on a regular basis.  This is the type of thing we're trying to head off.

It should be noted that you can increase the PG num of a pool.. but cannot decrease!   The only way to reduce your cluster PG count is to create new smaller PG num pools, migrate the data and then delete the old, high PG count pools.  You could also simply add more OSDs to reduce the PG per OSD ratio.

The issue with too few PGs is poor data distribution.  So it's all about having enough PGs to get good data distribution without going too high and having resource exhaustion during recovery.

Hope this helps put things into perspective.

Michael J. Kidd
Sr. Storage Consultant
Inktank Professional Services
 - by Red Hat

On Wed, Jan 7, 2015 at 4:34 PM, Sanders, Bill <Bill.Sanders@xxxxxxxxxxxx> wrote:
This is interesting.  Kudos to you guys for getting the calculator up, I think this'll help some folks.

I have 1 pool, 40 OSDs, and replica of 3.  I based my PG count on: http://ceph.com/docs/master/rados/operations/placement-groups/

'''
Less than 5 OSDs set pg_num to 128
Between 5 and 10 OSDs set pg_num to 512
Between 10 and 50 OSDs set pg_num to 4096
'''

But the calculator gives a different result of 2048.  Out of curiosity, what sorts of issues might one encounter by having too many placement groups?  I understand there's some resource overhead.  I don't suppose it would manifest itself in a recognizable way?

Bill


From: ceph-users [ceph-users-bounces@xxxxxxxxxxxxxx] on behalf of Michael J. Kidd [michael.kidd@xxxxxxxxxxx]
Sent: Wednesday, January 07, 2015 3:51 PM
To: Loic Dachary
Cc: ceph-users@xxxxxxxx
Subject: Re: PG num calculator live on Ceph.com

> Where is the source ?
On the page.. :)  It does link out to jquery and jquery-ui, but all the custom bits are embedded in the HTML.

Glad it's helpful :)

Michael J. Kidd
Sr. Storage Consultant
Inktank Professional Services
 - by Red Hat

On Wed, Jan 7, 2015 at 3:46 PM, Loic Dachary <loic@xxxxxxxxxxx> wrote:


On 07/01/2015 23:08, Michael J. Kidd wrote:
> Hello all,
>   Just a quick heads up that we now have a PG calculator to help determine the proper PG per pool numbers to achieve a target PG per OSD ratio.
>
> http://ceph.com/pgcalc
>
> Please check it out!  Happy to answer any questions, and always welcome any feedback on the tool / verbiage, etc...

Great work ! That will be immensely useful :-)

Where is the source ?

Cheers

>
> As an aside, we're also working to update the documentation to reflect the best practices.  See Ceph.com tracker for this at:
> http://tracker.ceph.com/issues/9867
>
> Thanks!
> Michael J. Kidd
> Sr. Storage Consultant
> Inktank Professional Services
>  - by Red Hat
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

--
Loïc Dachary, Artisan Logiciel Libre



_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux