Re: Missing hashpspool on some pools

Sage Weil <sage@xxxxxxxxxxxx> · Mon, 9 Jan 2017 23:23:52 +0000 (UTC)

On Mon, 9 Jan 2017, Stillwell, Bryan J wrote:
> Recently I noticed that we're missing the 'hashpspool' flag on some of our
> production pools which is causing the acting set of OSDs to be the same
> across PGs in different pools:
> 
>   3.17 [1089,17,447]
>   4.16 [1089,17,447]
>   6.14 [1089,17,447]
>   ^-- Notice how if you add the pool number and pg number together you get
>   20 for the PGs with the same acting set
> 
> 
>   3.18 [34,146,387]
>   4.17 [34,146,387]
>   6.15 [34,146,387]
>   ^-- Notice how if you add the pool number and pg number together you get
>   21 for the PGs with the same acting set
> 
> 
> Here's what those pools look like:
> 
>   pool 3 'images' replicated size 3 min_size 2 crush_ruleset 2 object_hash
> rjenkins pg_num 4096 pgp_num 4096 last_change 842505
> min_read_recency_for_promote 1 stripe_width 0
>   pool 4 'volumes' replicated size 3 min_size 2 crush_ruleset 2
> object_hash rjenkins pg_num 32768 pgp_num 32768 last_change 842506
> min_read_recency_for_promote 1 stripe_width 0
>   pool 6 'instances' replicated size 3 min_size 2 crush_ruleset 2
> object_hash rjenkins pg_num 4096 pgp_num 4096 last_change 842337
> min_read_recency_for_promote 1 stripe_width 0
> 
> 
> From what I've read it seems like setting the 'hashpspool' flag would help
> in data distribution, but also improve recovery times since there will be
> more diversity in the PGs acting sets.
> 
> The problem with setting this flag on existing pools is that every PG in
> that pool will get remapped to a new set of OSDs.  The data distribution
> on those pools is ~8% for images, ~9% for instances, and ~83% for volumes.
>  So I'm thinking that I could set the 'hashpspool' flag on only the images
> and instances pools and get the same benefit as setting it for all three,
> but my question for everyone is whether or not it is worth it?

You're right that setting it on just the smaller pools will get all the 
benefit.   And only move 17% of your data.  I can't really tell you if 
it's worth it or not, though.. it basically means that 1/8th of your pgs 
are 3x bigger than the others, which throws off the data balance 
somewhat.. but not *that* much.  Depends on how close to capacity you run 
(or plan to run) things, probably.

> Since this will be a lot of data movement, I'm also concerned with the
> monitor store growing too large.  With over 1400 OSDs I'm seeing messages
> like "store is getting too big! 18310 MB >= 15360 MB" rather often these
> days...

Eh, just increase the warning threshold.  THere is nothing wrong iwth a 
big store as long as you have space.

Also, in the not-too-distant future, 90% of this data will disappear from 
the mons entirely (and become ephemeral ceph-mgr state).

sage
_______________________________________________
Ceph-large mailing list
Ceph-large@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-large-ceph.com