Re: poor data distribution

Dominik Mostowiec <dominikmostowiec@xxxxxxxxx> · Tue, 4 Feb 2014 12:09:52 +0100

Hi,
Thanks for Your help !!
We've done again 'ceph osd reweight-by-utilization 105'
Cluster stack on 10387 active+clean, 237 active+remapped;
More info in attachments.

--
Regards
Dominik

2014-02-04 Sage Weil <sage@xxxxxxxxxxx>:
> Hi,
>
> I spent a couple hours looking at your map because it did look like there
> was something wrong.  After some experimentation and adding a bucnh of
> improvements to osdmaptool to test the distribution, though, I think
> everything is working as expected.  For pool 3, your map has a standard
> deviation in utilizations of ~8%, and we should expect ~9% for this number
> of PGs.  For all pools, it is slightly higher (~9% vs expected ~8%).
> This is either just in the noise, or slightly confounded by the lack of
> the hashpspool flag on the pools (which slightly amplifies placement
> nonuniformity with multiple pools... not enough that it is worth changing
> anything though).
>
> The bad news is that that order of standard deviation results in pretty
> wide min/max range of 118 to 202 pgs.  That seems a *bit* higher than we a
> perfectly random placement generates (I'm seeing a spread in that is
> usually 50-70 pgs), but I think *that* is where the pool overlap (no
> hashpspool) is rearing its head; for just pool three the spread of 50 is
> about what is expected.
>
> Long story short: you have two options.  One is increasing the number of
> PGs.  Note that this helps but has diminishing returns (doubling PGs
> only takes you from ~8% to ~6% standard deviation, quadrupling to ~4%).
>
> The other is to use reweight-by-utilization.  That is the best approach,
> IMO.  I'm not sure why you were seeing PGs stuck in the remapped state
> after you did that, though, but I'm happy to dig into that too.
>
> BTW, the osdmaptool addition I was using to play with is here:
>         https://github.com/ceph/ceph/pull/1178
>
> sage
>
>
> On Mon, 3 Feb 2014, Dominik Mostowiec wrote:
>
>> In other words,
>> 1. we've got 3 racks ( 1 replica per rack )
>> 2. in every rack we have 3 hosts
>> 3. every host has 22 OSD's
>> 4. all pg_num's are 2^n for every pool
>> 5. we enabled "crush tunables optimal".
>> 6. on every machine we disabled 4 unused disk's (osd out, osd reweight
>> 0 and osd rm)
>>
>> Pool ".rgw.buckets": one osd has 105 PGs and other one (on the same
>> machine) has 144 PGs (37% more!).
>> Other pools also have got this problem. It's not efficient placement.
>>
>> --
>> Regards
>> Dominik
>>
>>
>> 2014-02-02 Dominik Mostowiec <dominikmostowiec@xxxxxxxxx>:
>> > Hi,
>> > For more info:
>> >   crush: http://dysk.onet.pl/link/r4wGK
>> >   osd_dump: http://dysk.onet.pl/link/I3YMZ
>> >   pg_dump: http://dysk.onet.pl/link/4jkqM
>> >
>> > --
>> > Regards
>> > Dominik
>> >
>> > 2014-02-02 Dominik Mostowiec <dominikmostowiec@xxxxxxxxx>:
>> >> Hi,
>> >> Hmm,
>> >> You think about sumarize PGs from different pools on one OSD's i think.
>> >> But for one pool (.rgw.buckets) where i have almost of all my data, PG
>> >> count on OSDs is aslo different.
>> >> For example 105 vs 144 PGs from pool .rgw.buckets. In first case it is
>> >> 52% disk usage, second 74%.
>> >>
>> >> --
>> >> Regards
>> >> Dominik
>> >>
>> >>
>> >> 2014-02-02 Sage Weil <sage@xxxxxxxxxxx>:
>> >>> It occurs to me that this (and other unexplain variance reports) could
>> >>> easily be the 'hashpspool' flag not being set.  The old behavior had the
>> >>> misfeature where consecutive pool's pg's would 'line up' on the same osds,
>> >>> so that 1.7 == 2.6 == 3.5 == 4.4 etc would map to the same nodes.  This
>> >>> tends to 'amplify' any variance in the placement.  The default is still to
>> >>> use the old behavior for compatibility (this will finally change in
>> >>> firefly).
>> >>>
>> >>> You can do
>> >>>
>> >>>  ceph osd pool set <poolname> hashpspool true
>> >>>
>> >>> to enable the new placement logic on an existing pool, but be warned that
>> >>> this will rebalance *all* of the data in the pool, which can be a very
>> >>> heavyweight operation...
>> >>>
>> >>> sage
>> >>>
>> >>>
>> >>> On Sun, 2 Feb 2014, Dominik Mostowiec wrote:
>> >>>
>> >>>> Hi,
>> >>>> After scrubbing almost all PGs has equal(~) num of objects.
>> >>>> I found something else.
>> >>>> On one host PG coun on OSDs:
>> >>>> OSD with small(52%) disk usage:
>> >>>> count, pool
>> >>>>     105 3
>> >>>>      18 4
>> >>>>       3 5
>> >>>>
>> >>>> Osd with larger(74%) disk usage:
>> >>>>     144 3
>> >>>>      31 4
>> >>>>       2 5
>> >>>>
>> >>>> Pool 3 is .rgw.buckets (where is almost of all data).
>> >>>> Pool 4 is .log, where is no data.
>> >>>>
>> >>>> Count of PGs shouldn't be the same per OSD ?
>> >>>> Or maybe PG hash algorithm is disrupted by wrong count of PG for pool
>> >>>> '4'. There is 1440 PGs ( this is not power of 2 ).
>> >>>>
>> >>>> ceph osd dump:
>> >>>> pool 0 'data' rep size 3 min_size 1 crush_ruleset 0 object_hash
>> >>>> rjenkins pg_num 64 pgp_num 64 last_change 28459 owner 0
>> >>>> crash_replay_interval 45
>> >>>> pool 1 'metadata' rep size 3 min_size 1 crush_ruleset 1 object_hash
>> >>>> rjenkins pg_num 64 pgp_num 64 last_change 28460 owner 0
>> >>>> pool 2 'rbd' rep size 3 min_size 1 crush_ruleset 2 object_hash
>> >>>> rjenkins pg_num 64 pgp_num 64 last_change 28461 owner 0
>> >>>> pool 3 '.rgw.buckets' rep size 3 min_size 1 crush_ruleset 0
>> >>>> object_hash rjenkins pg_num 8192 pgp_num 8192 last_change 73711 owner
>> >>>> 0
>> >>>> pool 4 '.log' rep size 3 min_size 1 crush_ruleset 0 object_hash
>> >>>> rjenkins pg_num 1440 pgp_num 1440 last_change 28463 owner 0
>> >>>> pool 5 '.rgw' rep size 3 min_size 1 crush_ruleset 0 object_hash
>> >>>> rjenkins pg_num 128 pgp_num 128 last_change 72467 owner 0
>> >>>> pool 6 '.users.uid' rep size 3 min_size 1 crush_ruleset 0 object_hash
>> >>>> rjenkins pg_num 8 pgp_num 8 last_change 28465 owner 0
>> >>>> pool 7 '.users' rep size 3 min_size 1 crush_ruleset 0 object_hash
>> >>>> rjenkins pg_num 8 pgp_num 8 last_change 28466 owner 0
>> >>>> pool 8 '.usage' rep size 2 min_size 1 crush_ruleset 0 object_hash
>> >>>> rjenkins pg_num 8 pgp_num 8 last_change 28467 owner
>> >>>> 18446744073709551615
>> >>>> pool 9 '.intent-log' rep size 3 min_size 1 crush_ruleset 0 object_hash
>> >>>> rjenkins pg_num 8 pgp_num 8 last_change 28468 owner
>> >>>> 18446744073709551615
>> >>>> pool 10 '.rgw.control' rep size 3 min_size 1 crush_ruleset 0
>> >>>> object_hash rjenkins pg_num 8 pgp_num 8 last_change 33485 owner
>> >>>> 18446744073709551615
>> >>>> pool 11 '.rgw.gc' rep size 3 min_size 1 crush_ruleset 0 object_hash
>> >>>> rjenkins pg_num 8 pgp_num 8 last_change 33487 owner
>> >>>> 18446744073709551615
>> >>>> pool 12 '.rgw.root' rep size 2 min_size 1 crush_ruleset 0 object_hash
>> >>>> rjenkins pg_num 8 pgp_num 8 last_change 44540 owner 0
>> >>>> pool 13 '' rep size 2 min_size 1 crush_ruleset 0 object_hash rjenkins
>> >>>> pg_num 8 pgp_num 8 last_change 46912 owner 0
>> >>>>
>> >>>> --
>> >>>> Regards
>> >>>> Dominik
>> >>>>
>> >>>> 2014-02-01 Dominik Mostowiec <dominikmostowiec@xxxxxxxxx>:
>> >>>> > Hi,
>> >>>> >> Did you bump pgp_num as well?
>> >>>> > Yes.
>> >>>> >
>> >>>> > See: http://dysk.onet.pl/link/BZ968
>> >>>> >
>> >>>> >> 25% pools is two times smaller from other.
>> >>>> > This is changing after scrubbing.
>> >>>> >
>> >>>> > --
>> >>>> > Regards
>> >>>> > Dominik
>> >>>> >
>> >>>> > 2014-02-01 Kyle Bader <kyle.bader@xxxxxxxxx>:
>> >>>> >>
>> >>>> >>> Change pg_num for .rgw.buckets to power of 2, an 'crush tunables
>> >>>> >>> optimal' didn't help :(
>> >>>> >>
>> >>>> >> Did you bump pgp_num as well? The split pgs will stay in place until pgp_num
>> >>>> >> is bumped as well, if you do this be prepared for (potentially lots) of data
>> >>>> >> movement.
>> >>>> >
>> >>>> >
>> >>>> >
>> >>>> > --
>> >>>> > Pozdrawiam
>> >>>> > Dominik
>> >>>>
>> >>>>
>> >>>>
>> >>>> --
>> >>>> Pozdrawiam
>> >>>> Dominik
>> >>>>
>> >>>>
>> >>
>> >>
>> >>
>> >> --
>> >> Pozdrawiam
>> >> Dominik
>> >
>> >
>> >
>> > --
>> > Pozdrawiam
>> > Dominik
>>
>>
>>
>> --
>> Pozdrawiam
>> Dominik
>>
>>

-- 
Pozdrawiam
Dominik
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com