Re: "too many PGs per OSD" in Hammer

Chris Armstrong <carmstrong@xxxxxxxxxxxxxx> · Wed, 6 May 2015 15:39:29 -0700

Here's a little more information on our use case: https://github.com/deis/deis/issues/3638

On Wed, May 6, 2015 at 2:53 PM, Chris Armstrong <carmstrong@xxxxxxxxxxxxxx> wrote:
Thanks for the feedback. That language is confusing to me, then, since the first paragraph seems to suggest using a pg_num of 128 in cases where we have less than 5 OSDs, as we do here.
The warning below that is: "As the number of OSDs increases, chosing the right value for pg_num becomes more important because it has a significant influence on the behavior of the cluster as well as the durability of the data when something goes wrong (i.e. the probability that a catastrophic event leads to data loss).", which suggests that this could be an issue with more OSDs, which doesn't apply here.

Do we know if this warning is calculated based on the resources of the host? If I try with larger machines, will this warning change?

On Wed, May 6, 2015 at 2:41 PM,  <ceph@xxxxxxxxxxxxxx> wrote:
Hi,

You've too many PG for too few OSD

As the docs you linked said:

When using multiple data pools for storing objects, you need to ensure

that you balance the number of placement groups per pool with the number

of placement groups per OSD so that you arrive at a reasonable total

number of placement groups that provides reasonably low variance per OSD

without taxing system resources or making the peering process too slow.

For instance a cluster of 10 pools each with 512 placement groups on ten

OSDs is a total of 5,120 placement groups spread over ten OSDs, that is

512 placement groups per OSD. That does not use too many resources.

However, if 1,000 pools were created with 512 placement groups each, the

OSDs will handle ~50,000 placement groups each and it would require

significantly more resources and time for peering.

So, remove useless pools or add OSDs

On 06/05/2015 23:32, Chris Armstrong wrote:

> Hi folks,

>

> Calling on the collective Ceph knowledge here. Since upgrading to

> Hammer, we're now seeing:

>

>      health HEALTH_WARN

>             too many PGs per OSD (1536 > max 300)

>

> We have 3 OSDs, so we have used the pg_num of 128 based on the

> suggestion here:

> http://ceph.com/docs/master/rados/operations/placement-groups/

>

> We're also using the 12 default pools:

> root@ca-deis-1:/# ceph osd lspools

> 0 rbd,1 data,2 metadata,3 .rgw.root,4 .rgw.control,5 .rgw,6 .rgw.gc,7

> .users.uid,8 .users,9 .rgw.buckets.index,10 .rgw.buckets,11

> .rgw.buckets.extra,

>

>

> Here's the output of ceph osd dump:

>

> root@ca-deis-1:/# ceph osd dump

> epoch 46

> fsid 7bd27c76-f5f8-4eea-819b-379177929653

> created 2015-05-06 20:40:01.658764

> modified 2015-05-06 21:05:18.391730

> flags

> pool 0 'rbd' replicated size 3 min_size 1 crush_ruleset 0 object_hash

> rjenkins pg_num 128 pgp_num 128 last_change 18 flags hashpspool

> stripe_width 0

> pool 1 'data' replicated size 3 min_size 1 crush_ruleset 0 object_hash

> rjenkins pg_num 128 pgp_num 128 last_change 11 flags hashpspool

> crash_replay_interval 45 stripe_width 0

> pool 2 'metadata' replicated size 3 min_size 1 crush_ruleset 0

> object_hash rjenkins pg_num 128 pgp_num 128 last_change 10 flags

> hashpspool stripe_width 0

> pool 3 '.rgw.root' replicated size 3 min_size 1 crush_ruleset 0

> object_hash rjenkins pg_num 128 pgp_num 128 last_change 20 flags

> hashpspool stripe_width 0

> pool 4 '.rgw.control' replicated size 3 min_size 1 crush_ruleset 0

> object_hash rjenkins pg_num 128 pgp_num 128 last_change 22 flags

> hashpspool stripe_width 0

> pool 5 '.rgw' replicated size 3 min_size 1 crush_ruleset 0 object_hash

> rjenkins pg_num 128 pgp_num 128 last_change 24 flags hashpspool

> stripe_width 0

> pool 6 '.rgw.gc' replicated size 3 min_size 1 crush_ruleset 0

> object_hash rjenkins pg_num 128 pgp_num 128 last_change 25 flags

> hashpspool stripe_width 0

> pool 7 '.users.uid' replicated size 3 min_size 1 crush_ruleset 0

> object_hash rjenkins pg_num 128 pgp_num 128 last_change 26 flags

> hashpspool stripe_width 0

> pool 8 '.users' replicated size 3 min_size 1 crush_ruleset 0 object_hash

> rjenkins pg_num 128 pgp_num 128 last_change 28 flags hashpspool

> stripe_width 0

> pool 9 '.rgw.buckets.index' replicated size 3 min_size 1 crush_ruleset 0

> object_hash rjenkins pg_num 128 pgp_num 128 last_change 30 flags

> hashpspool stripe_width 0

> pool 10 '.rgw.buckets' replicated size 3 min_size 1 crush_ruleset 0

> object_hash rjenkins pg_num 128 pgp_num 128 last_change 35 flags

> hashpspool stripe_width 0

> pool 11 '.rgw.buckets.extra' replicated size 3 min_size 1 crush_ruleset

> 0 object_hash rjenkins pg_num 128 pgp_num 128 last_change 40 flags

> hashpspool stripe_width 0

> max_osd 3

> osd.0 up   in  weight 1 up_from 4 up_thru 45 down_at 0

> last_clean_interval [0,0) 10.132.162.16:6800/1

> <http://10.132.162.16:6800/1> 10.132.162.16:6801/1

> <http://10.132.162.16:6801/1> 10.132.162.16:6802/1

> <http://10.132.162.16:6802/1> 10.132.162.16:6803/1

> <http://10.132.162.16:6803/1> exists,up d996b242-7fce-475f-a889-fa14038de180

> osd.1 up   in  weight 1 up_from 7 up_thru 45 down_at 0

> last_clean_interval [0,0) 10.132.253.121:6800/1

> <http://10.132.253.121:6800/1> 10.132.253.121:6801/1

> <http://10.132.253.121:6801/1> 10.132.253.121:6802/1

> <http://10.132.253.121:6802/1> 10.132.253.121:6803/1

> <http://10.132.253.121:6803/1> exists,up

> 8ef7080d-ca37-4003-ae54-b76ddd13f752

> osd.2 up   in  weight 1 up_from 45 up_thru 45 down_at 43

> last_clean_interval [38,44) 10.132.253.118:6801/1

> <http://10.132.253.118:6801/1> 10.132.253.118:6805/1000001

> <http://10.132.253.118:6805/1000001> 10.132.253.118:6806/1000001

> <http://10.132.253.118:6806/1000001> 10.132.253.118:6807/1000001

> <http://10.132.253.118:6807/1000001> exists,up

> 7b30f8aa-732b-4dca-bfbd-2dca9fb3c5ec

>

> Note that we have 3 replicas of our data (size 3) so that we can operate

> with just one host up.

>

> We've seen performance issues before (especially during platform start),

> which has me thinking - are we using too many placement groups given the

> small number of OSDs and the fact that we're forcing each OSD to have a

> full set of the data with size=3? Maybe the performance issues are to be

> expected since we're pushing around so many PGs on startup.

>

> This logic has not changed since our use of firefly and giant, so I'm

> not sure what changed. Some guidance is appreciated.

>

> Thanks!

>

> Chris

>

>

> _______________________________________________

> ceph-users mailing list

> ceph-users@xxxxxxxxxxxxxx

> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Chris Armstrong | Deis Team Lead | Engine Yard | t: @carmstrong_afk | gh: carmstrong

Deis: github.com/deis/deis | docs.deis.io | #deis

Deis is now part of Engine Yard! http://deis.io/deis-meet-engine-yard/

-- 
Chris Armstrong | Deis Team Lead | Engine Yard | t: @carmstrong_afk | gh: carmstrong

Deis: github.com/deis/deis | docs.deis.io | #deis

Deis is now part of Engine Yard! http://deis.io/deis-meet-engine-yard/

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com