Re: too many PGs per OSD (307 > max 300)

Christian Balzer <chibi@xxxxxxx> · Mon, 1 Aug 2016 10:37:27 +0900

On Fri, 29 Jul 2016 16:20:03 +0800 Chengwei Yang wrote:

> On Fri, Jul 29, 2016 at 11:47:59AM +0900, Christian Balzer wrote:
> > On Fri, 29 Jul 2016 09:59:38 +0800 Chengwei Yang wrote:
> > 
> > > Hi list,
> > > 
> > > I just followed the placement group guide to set pg_num for the rbd pool.
> > > 
> > How many other pools do you have, or is that the only pool?
> 
> Yes, this is the only one.
> 
> > 
> > The numbers mentioned are for all pools, not per pool, something that
> > isn't abundantly clear from the documentation either.
> 
> Exactly, especially for newbie like me. :-)
> 
Given how often and how LONG this issue has come up, it really needs a
rewrite and lots of BOLD sentences. 

> > 
> > >   "
> > >   Less than 5 OSDs set pg_num to 128
> > >   Between 5 and 10 OSDs set pg_num to 512
> > >   Between 10 and 50 OSDs set pg_num to 4096
> > >   If you have more than 50 OSDs, you need to understand the tradeoffs and how to
> > >   calculate the pg_num value by yourself
> > >   For calculating pg_num value by yourself please take help of pgcalc tool
> > >   "
> > > 
> > You should have headed the hint about pgcalc, which is by far the best
> > thing to do.
> > The above numbers are an (imprecise) attempt to give a quick answer to a
> > complex question.
> > 
> > > Since I have 40 OSDs, so I set pg_num to 4096 according to the above
> > > recommendation.
> > > 
> > > However, after configured pg_num and pgp_num both to 4096, I found that my
> > > ceph cluster in **HEALTH_WARN** status, which does surprised me and still
> > > confusing me.
> > > 
> > PGcalc would recommend 2048 PGs at most (for a single pool) with 40 OSDs.
> 
> BTW, I read PGcal and found that it may also has some flaw as it says:
> 
> "
> If the value of the above calculation is less than the value of (OSD#) / (Size),
> then the value is updated to the value of ((OSD#) / (Size)). This is to ensure
> even load / data distribution by allocating at least one Primary or Secondary PG
> to every OSD for every Pool.
> "
> 
> However, in the above **OpenStack w RGW** use case, there are a lot of small
> pool with 32 PG that apparently smaller than OSD / Size(100/3 ～= 33.33).
> 
> I do mean it though it's not smaller a lot. :-)
>

Well, there are always trade-offs to "automatic" solutions like this when
operating either small or large clusters.

While the goal of distributing pools amongst all OSDs is commendable, it
is also not going to be realistic in all cases.

Nor is it typically necessary, since a small (data size) pool is
supposedly going to see less activity than a larger one, so the amount of
IOPS (# of OSDs) is going to be lower, too.

In cases where that might not true (CephFS metadata comes to mind),
putting such a pool on SSD based OSDs might be the better choice than
increasing PGs on HDD based OSDs.

Or if you have a large (data size) pool that is being used for something
like backups and sees very little activity, give that one less PGs than
you'd normally do and give those PGs to more active ones.

It boils down to the "understanding" part.

> > 
> > I assume the above high number of 4096 stems from the wisdom that with
> > small clusters more PGs than normally recommended (100 per OSD) can be
> > helpful. 
> > It was also probably written before those WARN calculations were added to
> > Ceph.
> > 
> > The above would better read:
> > ---
> > Use PGcalc!
> > [...]
> > Between 10 and 20 OSDs set pg_num to 1024
> > Between 20 and 40 OSDs set pg_num to 2048
> > 
> > Over 40 definitely use and understand PGcalc.
> > ---
> > 
> > > >   cluster bf6fa9e4-56db-481e-8585-29f0c8317773
> > >      health HEALTH_WARN
> > >             too many PGs per OSD (307 > max 300)
> > > 
> > > I see the cluster also says "4096 active+clean" so it's safe, but I do not like
> > > the HEALTH_WARN in anyway.
> > >
> > You can ignore it, but yes, it is annoying.
> >  
> > > As I know(from ceph -s output), the recommended pg_num per OSD is [30, 300], any
> > > other pg_num out of this range with bring cluster to HEALTH_WARN.
> > > 
> > > So what I would like to say: is the document misleading? Should we fix it?
> > > 
> > Definitely.
> 
> OK, I'd like to submit a PR.
> 
Go right ahead and don't look at me, I'm not working for Red Hat, Ceph. ^o^

Christian
-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com