Re: PG num calculator live on Ceph.com

"Sanders, Bill" <Bill.Sanders@xxxxxxxxxxxx> · Thu, 8 Jan 2015 01:00:48 +0000

Excellent, thanks for the detailed breakdown.

Take care,

Bill

From: Michael J. Kidd [michael.kidd@xxxxxxxxxxx]

Sent: Wednesday, January 07, 2015 4:50 PM

To: Sanders, Bill

Cc: Loic Dachary; ceph-users@xxxxxxxx

Subject: Re:  PG num calculator live on Ceph.com

Hello Bill,

  Either 2048 or 4096 should be acceptable.  4096 gives about a 300 PG per OSD ratio, which would leave room for tripling the OSD count without needing to increase the PG number.  While 2048 gives about 150 PGs per OSD, not leaving room but for about a 50%
 OSD count expansion.

The high PG count per OSD issue really doesn't manifest aggressively until you get around 1000 PGs per OSD and beyond.  At those levels, steady state operation continues without issue.. but recovery within the cluster will see the memory utilization of the
 OSDs climb and could push into out of memory conditions on the OSD host (or at a minimum, heavy swap usage if enabled).  It still depends of course on the # of OSDs per node, and the amount of memory on the node as to if you'll actually experience issues or
 not.

As an example though, I worked on a cluster which was about 5500 PGs per OSD.  The cluster experienced a network config issue in the switchgear which isolated 2/3's of the OSD nodes from each other and the other 1/3 of the cluster.  When the network issue was
 cleared, the OSDs started dropping like flies... They'd start up, spool up the memory they needed for map update parsing, and get killed before making any real headway.  We were finally able to get the cluster online by limiting what the OSDs were doing to
 a small slice of the normal start-up, waiting for the OSDs to calm down, then opening up a bit more for them to do (noup, noin, norecover, nobackfill, pause, noscrub, nodeep-scrub were all set, and then unset one at a time until all OSDs were up/in and able
 to handle the recovery).

6 weeks later, that same cluster lost about 40% of the OSDs during a power outage due to corruption from an HBA bug.. (it didn't flush the write cache to disk).  This pushed the PG per OSD count over 9000!!  It simply couldn't recover with the available memory
 at that PG count.  Each OSD, started by itself, would consume > 60gb of RAM and get killed (the nodes only had 64gb total).

While this is an extreme example... we see cases generated with > 1000 PGs per OSD on a regular basis.  This is the type of thing we're trying to head off.

It should be noted that you can increase the PG num of a pool.. but cannot decrease!   The only way to reduce your cluster PG count is to create new smaller PG num pools, migrate the data and then delete the old, high PG count pools.  You could also simply
 add more OSDs to reduce the PG per OSD ratio.

The issue with too few PGs is poor data distribution.  So it's all about having enough PGs to get good data distribution without going too high and having resource exhaustion during recovery.

Hope this helps put things into perspective.

Michael J. Kidd

Sr. Storage Consultant

Inktank Professional Services

 - by Red Hat

On Wed, Jan 7, 2015 at 4:34 PM, Sanders, Bill 
<Bill.Sanders@xxxxxxxxxxxx> wrote:

This is interesting.  Kudos to you guys for getting the calculator up, I think this'll help some folks.

I have 1 pool, 40 OSDs, and replica of 3.  I based my PG count on: 
http://ceph.com/docs/master/rados/operations/placement-groups/

'''

Less than 5 OSDs set pg_num to 128

Between 5 and 10 OSDs set pg_num to 512

Between 10 and 50 OSDs set pg_num to 4096

'''

But the calculator gives a different result of 2048.  Out of curiosity, what sorts of issues might one encounter by having too many placement groups?  I understand there's some resource overhead.  I don't suppose it would manifest itself in a recognizable way?

Bill

From: ceph-users [ceph-users-bounces@xxxxxxxxxxxxxx] on behalf of Michael J. Kidd [michael.kidd@xxxxxxxxxxx]

Sent: Wednesday, January 07, 2015 3:51 PM

To: Loic Dachary

Cc: ceph-users@xxxxxxxx

Subject: Re:  PG num calculator live on Ceph.com

> Where is the source ?

On the page.. :)  It does link out to jquery and jquery-ui, but all the custom bits are embedded in the HTML.

Glad it's helpful :)

Michael J. Kidd

Sr. Storage Consultant

Inktank Professional Services

 - by Red Hat

On Wed, Jan 7, 2015 at 3:46 PM, Loic Dachary 
<loic@xxxxxxxxxxx> wrote:

On 07/01/2015 23:08, Michael J. Kidd wrote:

> Hello all,

>   Just a quick heads up that we now have a PG calculator to help determine the proper PG per pool numbers to achieve a target PG per OSD ratio.

>

> http://ceph.com/pgcalc

>

> Please check it out!  Happy to answer any questions, and always welcome any feedback on the tool / verbiage, etc...

Great work ! That will be immensely useful :-)

Where is the source ?

Cheers

>

> As an aside, we're also working to update the documentation to reflect the best practices.  See Ceph.com tracker for this at:

> http://tracker.ceph.com/issues/9867

>

> Thanks!

> Michael J. Kidd

> Sr. Storage Consultant

> Inktank Professional Services

>  - by Red Hat

>

>

> _______________________________________________

> ceph-users mailing list

> ceph-users@xxxxxxxxxxxxxx

> 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>

--

Loïc Dachary, Artisan Logiciel Libre

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com