Re: too many PGs per OSD (307 > max 300)

Chengwei Yang <chengwei.yang.cn@xxxxxxxxx> · Mon, 1 Aug 2016 14:07:21 +0800

On Mon, Aug 01, 2016 at 10:37:27AM +0900, Christian Balzer wrote:
> On Fri, 29 Jul 2016 16:20:03 +0800 Chengwei Yang wrote:
> 
> > On Fri, Jul 29, 2016 at 11:47:59AM +0900, Christian Balzer wrote:
> > > On Fri, 29 Jul 2016 09:59:38 +0800 Chengwei Yang wrote:
> > > 
> > > > Hi list,
> > > > 
> > > > I just followed the placement group guide to set pg_num for the rbd pool.
> > > > 
> > > How many other pools do you have, or is that the only pool?
> > 
> > Yes, this is the only one.
> > 
> > > 
> > > The numbers mentioned are for all pools, not per pool, something that
> > > isn't abundantly clear from the documentation either.
> > 
> > Exactly, especially for newbie like me. :-)
> > 
> Given how often and how LONG this issue has come up, it really needs a
> rewrite and lots of BOLD sentences. 
> 
> > > 
> > > >   "
> > > >   Less than 5 OSDs set pg_num to 128
> > > >   Between 5 and 10 OSDs set pg_num to 512
> > > >   Between 10 and 50 OSDs set pg_num to 4096
> > > >   If you have more than 50 OSDs, you need to understand the tradeoffs and how to
> > > >   calculate the pg_num value by yourself
> > > >   For calculating pg_num value by yourself please take help of pgcalc tool
> > > >   "
> > > > 
> > > You should have headed the hint about pgcalc, which is by far the best
> > > thing to do.
> > > The above numbers are an (imprecise) attempt to give a quick answer to a
> > > complex question.
> > > 
> > > > Since I have 40 OSDs, so I set pg_num to 4096 according to the above
> > > > recommendation.
> > > > 
> > > > However, after configured pg_num and pgp_num both to 4096, I found that my
> > > > ceph cluster in **HEALTH_WARN** status, which does surprised me and still
> > > > confusing me.
> > > > 
> > > PGcalc would recommend 2048 PGs at most (for a single pool) with 40 OSDs.
> > 
> > BTW, I read PGcal and found that it may also has some flaw as it says:
> > 
> > "
> > If the value of the above calculation is less than the value of (OSD#) / (Size),
> > then the value is updated to the value of ((OSD#) / (Size)). This is to ensure
> > even load / data distribution by allocating at least one Primary or Secondary PG
> > to every OSD for every Pool.
> > "
> > 
> > However, in the above **OpenStack w RGW** use case, there are a lot of small
> > pool with 32 PG that apparently smaller than OSD / Size(100/3 ～= 33.33).
> > 
> > I do mean it though it's not smaller a lot. :-)
> >
> 
> Well, there are always trade-offs to "automatic" solutions like this when
> operating either small or large clusters.
> 
> While the goal of distributing pools amongst all OSDs is commendable, it
> is also not going to be realistic in all cases.
> 
> Nor is it typically necessary, since a small (data size) pool is
> supposedly going to see less activity than a larger one, so the amount of
> IOPS (# of OSDs) is going to be lower, too.
> 
> In cases where that might not true (CephFS metadata comes to mind),
> putting such a pool on SSD based OSDs might be the better choice than
> increasing PGs on HDD based OSDs.
> 
> Or if you have a large (data size) pool that is being used for something
> like backups and sees very little activity, give that one less PGs than
> you'd normally do and give those PGs to more active ones.

Thanks, it's much clear now for me.

> 
> It boils down to the "understanding" part.
> 
> > > 
> > > I assume the above high number of 4096 stems from the wisdom that with
> > > small clusters more PGs than normally recommended (100 per OSD) can be
> > > helpful. 
> > > It was also probably written before those WARN calculations were added to
> > > Ceph.
> > > 
> > > The above would better read:
> > > ---
> > > Use PGcalc!
> > > [...]
> > > Between 10 and 20 OSDs set pg_num to 1024
> > > Between 20 and 40 OSDs set pg_num to 2048
> > > 
> > > Over 40 definitely use and understand PGcalc.
> > > ---
> > > 
> > > > >   cluster bf6fa9e4-56db-481e-8585-29f0c8317773
> > > >      health HEALTH_WARN
> > > >             too many PGs per OSD (307 > max 300)
> > > > 
> > > > I see the cluster also says "4096 active+clean" so it's safe, but I do not like
> > > > the HEALTH_WARN in anyway.
> > > >
> > > You can ignore it, but yes, it is annoying.
> > >  
> > > > As I know(from ceph -s output), the recommended pg_num per OSD is [30, 300], any
> > > > other pg_num out of this range with bring cluster to HEALTH_WARN.
> > > > 
> > > > So what I would like to say: is the document misleading? Should we fix it?
> > > > 
> > > Definitely.
> > 
> > OK, I'd like to submit a PR.
> > 
> Go right ahead and don't look at me, I'm not working for Red Hat, Ceph. ^o^
> 
> 
> Christian
> -- 
> Christian Balzer        Network/Systems Engineer                
> chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
> http://www.gol.com/

-- 
Thanks,
Chengwei
Attachment:
signature.asc

Description: Digital signature
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com