Re: pg_num docs conflict with Hammer PG count warning

Wido den Hollander <wido@xxxxxxxx> · Thu, 06 Aug 2015 10:18:22 +0200

On 06-08-15 10:16, Hector Martin wrote:
> We have 48 OSDs (on 12 boxes, 4T per OSD) and 4 pools:
> - 3 replicated pools (3x)
> - 1 RS pool (5+2, size 7)
> 
> The docs say:
> http://ceph.com/docs/master/rados/operations/placement-groups/
> "Between 10 and 50 OSDs set pg_num to 4096"
> 
> Which is what we did when creating those pools. This yields 16384 PGs
> over 48 OSDs, which sounded reasonable at the time: 341 per OSD.
> 

The mount of PGs is cluster wide and not per pool. So if you have 48
OSDs the rule of thumb is: 48 * 100 / 3 = 1600 PGs cluster wide.

Now, with enough memory you can easily have 100 PGs per OSD, but keep in
mind that the PG count is cluster-wide and not per pool.

Wido

> However, upon upgrade to Hammer, it started complaining:
>      health HEALTH_WARN
>             too many PGs per OSD (1365 > max 300)
> 
> It seems the actual math multiplies everything by the size of the pools
> (which in retrospect makes sense): (3*4096*3 + 1*4096*7) / 48 = 1365
> 
> And Hammer by default sets:
> mon_pg_warn_max_per_osd = 300
> 
> For now I'm just going to bump up the setting to make the warning go
> away, but I'm concerned about the implications of this. Two of the 3x
> pools are not production and I can nuke and re-create them (with 512 PGs
> instead? Does that sound reasonable?), but the RS pool and the other rep
> pool are and there's no simple way for us to re-create them at this
> point (though that might be a good excuse to develop something that
> would enable that - which might be doable-ish for the RS pool at least,
> which is the biggest offender).
> 
> Questions:
> - Does this mean that the docs are wrong and need fixing? It seems that
> blindly following the docs can easily yield per-OSD PG counts that are
> off by a factor of 5 from the max, without doing anything too weird
> (just 4 reasonably simple pools)
> - Should I be concerned about the performance impact? How was the value
> 300 arrived at?
> - We're going to be using this cluster for more things (services), which
> means creating more pools. Should I plan ahead for, say, a time when we
> have 12 pools on it, and divide everything by 12? The cluster is
> currently very overprovisioned for space, so we're probably not going to
> be adding OSDs for quite a while, but we'll be adding pools.
> 
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com