Re: poor data distribution

Dominik Mostowiec <dominikmostowiec@xxxxxxxxx> · Wed, 12 Feb 2014 00:53:38 +0100

Hi,
If this problem (with stucked active+remapped pgs after
reweight-by-utilisation) affects all ceph configurations or only
specific ones?
If specific: what is the reason in my case? Is this caused by crush
configuration (cluster architecture, crush tunnables, ...), cluster
size, architecture design mistakes, or something else?

Second question.
Distribution PGs on OSDs is better for large clusters (where pg_num is
higher). It is possible(for small clusters) to chagne crush
distribution algorithm to more linear? (I realize that it will be less
efficient).

--
Regards
Dominik

2014-02-06 21:31 GMT+01:00 Dominik Mostowiec <dominikmostowiec@xxxxxxxxx>:
> Great!
> Thanks for Your help.
>
> --
> Regards
> Dominik
>
> 2014-02-06 21:10 GMT+01:00 Sage Weil <sage@xxxxxxxxxxx>:
>> On Thu, 6 Feb 2014, Dominik Mostowiec wrote:
>>> Hi,
>>> Thanks !!
>>> Can You suggest any workaround for now?
>>
>> You can adjust the crush weights on the overfull nodes slightly.  You'd
>> need to do it by hand, but that will do the trick.  For example,
>>
>>   ceph osd crush reweight osd.123 .96
>>
>> (if the current weight is 1.0).
>>
>> sage
>>
>>>
>>> --
>>> Regards
>>> Dominik
>>>
>>>
>>> 2014-02-06 18:39 GMT+01:00 Sage Weil <sage@xxxxxxxxxxx>:
>>> > Hi,
>>> >
>>> > Just an update here.  Another user saw this and after playing with it I
>>> > identified a problem with CRUSH.  There is a branch outstanding
>>> > (wip-crush) that is pending review, but it's not a quick fix because of
>>> > compatibility issues.
>>> >
>>> > sage
>>> >
>>> >
>>> > On Thu, 6 Feb 2014, Dominik Mostowiec wrote:
>>> >
>>> >> Hi,
>>> >> Mabye this info can help to find what is wrong.
>>> >> For one PG (3.1e4a) which is active+remapped:
>>> >> { "state": "active+remapped",
>>> >>   "epoch": 96050,
>>> >>   "up": [
>>> >>         119,
>>> >>         69],
>>> >>   "acting": [
>>> >>         119,
>>> >>         69,
>>> >>         7],
>>> >> Logs:
>>> >> On osd.7:
>>> >> 2014-02-04 09:45:54.966913 7fa618afe700  1 osd.7 pg_epoch: 94460
>>> >> pg[3.1e4a( v 94459'207004 (72275'204004,94459'207004] local-les=93486
>>> >> n=6718 ec=4 les/c 93486/93486 94460/94460/92233) [119,69] r=-1
>>> >> lpr=94460 pi=92546-94459/5 lcod 94459'207003 inactive NOTIFY]
>>> >> state<Start>: transitioning to Stray
>>> >> 2014-02-04 09:45:55.781278 7fa6172fb700  1 osd.7 pg_epoch: 94461
>>> >> pg[3.1e4a( v 94459'207004 (72275'204004,94459'207004] local-les=93486
>>> >> n=6718 ec=4 les/c 93486/93486 94460/94461/92233)
>>> >> [119,69]/[119,69,7,142] r=2 lpr=94461 pi=92546-94460/6 lcod
>>> >> 94459'207003 remapped NOTIFY] state<Start>: transitioning to Stray
>>> >> 2014-02-04 09:49:01.124510 7fa618afe700  1 osd.7 pg_epoch: 94495
>>> >> pg[3.1e4a( v 94459'207004 (72275'204004,94459'207004] local-les=94462
>>> >> n=6718 ec=4 les/c 94462/94494 94460/94495/92233) [119,69]/[119,69,7]
>>> >> r=2 lpr=94495 pi=92546-94494/7 lcod 94459'207003 remapped]
>>> >> state<Start>: transitioning to Stray
>>> >>
>>> >> On osd.119:
>>> >> 2014-02-04 09:45:54.981707 7f37f07c5700  1 osd.119 pg_epoch: 94460
>>> >> pg[3.1e4a( v 94459'207004 (72275'204004,94459'207004] local-les=93486
>>> >> n=6718 ec=4 les/c 93486/93486 94460/94460/92233) [119,69] r=0
>>> >> lpr=94460 pi=93485-94459/1 mlcod 0'0 inactive] state<Start>:
>>> >> transitioning to Primary
>>> >> 2014-02-04 09:45:55.805712 7f37ecfbe700  1 osd.119 pg_epoch: 94461
>>> >> pg[3.1e4a( v 94459'207004 (72275'204004,94459'207004] local-les=93486
>>> >> n=6718 ec=4 les/c 93486/93486 94460/94461/92233)
>>> >> [119,69]/[119,69,7,142] r=0 lpr=94461 pi=93485-94460/2 mlcod 0'0
>>> >> remapped] state<Start>: transitioning to Primary
>>> >> 2014-02-04 09:45:56.794015 7f37edfc0700  0 log [INF] : 3.1e4a
>>> >> restarting backfill on osd.69 from (0'0,0'0] MAX to 94459'207004
>>> >> 2014-02-04 09:49:01.156627 7f37ef7c3700  1 osd.119 pg_epoch: 94495
>>> >> pg[3.1e4a( v 94459'207004 (72275'204004,94459'207004] local-les=94462
>>> >> n=6718 ec=4 les/c 94462/94494 94460/94495/92233) [119,69]/[119,69,7]
>>> >> r=0 lpr=94495 pi=94461-94494/1 mlcod 0'0 remapped] state<Start>:
>>> >> transitioning to Primary
>>> >>
>>> >> On osd.69:
>>> >> 2014-02-04 09:45:56.845695 7f2231372700  1 osd.69 pg_epoch: 94462
>>> >> pg[3.1e4a( empty local-les=0 n=0 ec=4 les/c 93486/93486
>>> >> 94460/94461/92233) [119,69]/[119,69,7,142] r=1 lpr=94462
>>> >> pi=93485-94460/2 inactive] state<Start>: transitioning to Stray
>>> >> 2014-02-04 09:49:01.153695 7f2229b63700  1 osd.69 pg_epoch: 94495
>>> >> pg[3.1e4a( v 94459'207004 (72275'204004,94459'207004] local-les=94462
>>> >> n=6718 ec=4 les/c 94462/94494 94460/94495/92233) [119,69]/[119,69,7]
>>> >> r=1 lpr=94495 pi=93485-94494/3 remapped] state<Start>: transitioning
>>> >> to Stray
>>> >>
>>> >> pq query recovery state:
>>> >>   "recovery_state": [
>>> >>         { "name": "Started\/Primary\/Active",
>>> >>           "enter_time": "2014-02-04 09:49:02.070724",
>>> >>           "might_have_unfound": [],
>>> >>           "recovery_progress": { "backfill_target": -1,
>>> >>               "waiting_on_backfill": 0,
>>> >>               "backfill_pos": "0\/\/0\/\/-1",
>>> >>               "backfill_info": { "begin": "0\/\/0\/\/-1",
>>> >>                   "end": "0\/\/0\/\/-1",
>>> >>                   "objects": []},
>>> >>               "peer_backfill_info": { "begin": "0\/\/0\/\/-1",
>>> >>                   "end": "0\/\/0\/\/-1",
>>> >>                   "objects": []},
>>> >>               "backfills_in_flight": [],
>>> >>               "pull_from_peer": [],
>>> >>               "pushing": []},
>>> >>           "scrub": { "scrubber.epoch_start": "77502",
>>> >>               "scrubber.active": 0,
>>> >>               "scrubber.block_writes": 0,
>>> >>               "scrubber.finalizing": 0,
>>> >>               "scrubber.waiting_on": 0,
>>> >>               "scrubber.waiting_on_whom": []}},
>>> >>         { "name": "Started",
>>> >>           "enter_time": "2014-02-04 09:49:01.156626"}]}
>>> >>
>>> >> ---
>>> >> Regards
>>> >> Dominik
>>> >>
>>> >> 2014-02-04 12:09 GMT+01:00 Dominik Mostowiec <dominikmostowiec@xxxxxxxxx>:
>>> >> > Hi,
>>> >> > Thanks for Your help !!
>>> >> > We've done again 'ceph osd reweight-by-utilization 105'
>>> >> > Cluster stack on 10387 active+clean, 237 active+remapped;
>>> >> > More info in attachments.
>>> >> >
>>> >> > --
>>> >> > Regards
>>> >> > Dominik
>>> >> >
>>> >> >
>>> >> > 2014-02-04 Sage Weil <sage@xxxxxxxxxxx>:
>>> >> >> Hi,
>>> >> >>
>>> >> >> I spent a couple hours looking at your map because it did look like there
>>> >> >> was something wrong.  After some experimentation and adding a bucnh of
>>> >> >> improvements to osdmaptool to test the distribution, though, I think
>>> >> >> everything is working as expected.  For pool 3, your map has a standard
>>> >> >> deviation in utilizations of ~8%, and we should expect ~9% for this number
>>> >> >> of PGs.  For all pools, it is slightly higher (~9% vs expected ~8%).
>>> >> >> This is either just in the noise, or slightly confounded by the lack of
>>> >> >> the hashpspool flag on the pools (which slightly amplifies placement
>>> >> >> nonuniformity with multiple pools... not enough that it is worth changing
>>> >> >> anything though).
>>> >> >>
>>> >> >> The bad news is that that order of standard deviation results in pretty
>>> >> >> wide min/max range of 118 to 202 pgs.  That seems a *bit* higher than we a
>>> >> >> perfectly random placement generates (I'm seeing a spread in that is
>>> >> >> usually 50-70 pgs), but I think *that* is where the pool overlap (no
>>> >> >> hashpspool) is rearing its head; for just pool three the spread of 50 is
>>> >> >> about what is expected.
>>> >> >>
>>> >> >> Long story short: you have two options.  One is increasing the number of
>>> >> >> PGs.  Note that this helps but has diminishing returns (doubling PGs
>>> >> >> only takes you from ~8% to ~6% standard deviation, quadrupling to ~4%).
>>> >> >>
>>> >> >> The other is to use reweight-by-utilization.  That is the best approach,
>>> >> >> IMO.  I'm not sure why you were seeing PGs stuck in the remapped state
>>> >> >> after you did that, though, but I'm happy to dig into that too.
>>> >> >>
>>> >> >> BTW, the osdmaptool addition I was using to play with is here:
>>> >> >>         https://github.com/ceph/ceph/pull/1178
>>> >> >>
>>> >> >> sage
>>> >> >>
>>> >> >>
>>> >> >> On Mon, 3 Feb 2014, Dominik Mostowiec wrote:
>>> >> >>
>>> >> >>> In other words,
>>> >> >>> 1. we've got 3 racks ( 1 replica per rack )
>>> >> >>> 2. in every rack we have 3 hosts
>>> >> >>> 3. every host has 22 OSD's
>>> >> >>> 4. all pg_num's are 2^n for every pool
>>> >> >>> 5. we enabled "crush tunables optimal".
>>> >> >>> 6. on every machine we disabled 4 unused disk's (osd out, osd reweight
>>> >> >>> 0 and osd rm)
>>> >> >>>
>>> >> >>> Pool ".rgw.buckets": one osd has 105 PGs and other one (on the same
>>> >> >>> machine) has 144 PGs (37% more!).
>>> >> >>> Other pools also have got this problem. It's not efficient placement.
>>> >> >>>
>>> >> >>> --
>>> >> >>> Regards
>>> >> >>> Dominik
>>> >> >>>
>>> >> >>>
>>> >> >>> 2014-02-02 Dominik Mostowiec <dominikmostowiec@xxxxxxxxx>:
>>> >> >>> > Hi,
>>> >> >>> > For more info:
>>> >> >>> >   crush: http://dysk.onet.pl/link/r4wGK
>>> >> >>> >   osd_dump: http://dysk.onet.pl/link/I3YMZ
>>> >> >>> >   pg_dump: http://dysk.onet.pl/link/4jkqM
>>> >> >>> >
>>> >> >>> > --
>>> >> >>> > Regards
>>> >> >>> > Dominik
>>> >> >>> >
>>> >> >>> > 2014-02-02 Dominik Mostowiec <dominikmostowiec@xxxxxxxxx>:
>>> >> >>> >> Hi,
>>> >> >>> >> Hmm,
>>> >> >>> >> You think about sumarize PGs from different pools on one OSD's i think.
>>> >> >>> >> But for one pool (.rgw.buckets) where i have almost of all my data, PG
>>> >> >>> >> count on OSDs is aslo different.
>>> >> >>> >> For example 105 vs 144 PGs from pool .rgw.buckets. In first case it is
>>> >> >>> >> 52% disk usage, second 74%.
>>> >> >>> >>
>>> >> >>> >> --
>>> >> >>> >> Regards
>>> >> >>> >> Dominik
>>> >> >>> >>
>>> >> >>> >>
>>> >> >>> >> 2014-02-02 Sage Weil <sage@xxxxxxxxxxx>:
>>> >> >>> >>> It occurs to me that this (and other unexplain variance reports) could
>>> >> >>> >>> easily be the 'hashpspool' flag not being set.  The old behavior had the
>>> >> >>> >>> misfeature where consecutive pool's pg's would 'line up' on the same osds,
>>> >> >>> >>> so that 1.7 == 2.6 == 3.5 == 4.4 etc would map to the same nodes.  This
>>> >> >>> >>> tends to 'amplify' any variance in the placement.  The default is still to
>>> >> >>> >>> use the old behavior for compatibility (this will finally change in
>>> >> >>> >>> firefly).
>>> >> >>> >>>
>>> >> >>> >>> You can do
>>> >> >>> >>>
>>> >> >>> >>>  ceph osd pool set <poolname> hashpspool true
>>> >> >>> >>>
>>> >> >>> >>> to enable the new placement logic on an existing pool, but be warned that
>>> >> >>> >>> this will rebalance *all* of the data in the pool, which can be a very
>>> >> >>> >>> heavyweight operation...
>>> >> >>> >>>
>>> >> >>> >>> sage
>>> >> >>> >>>
>>> >> >>> >>>
>>> >> >>> >>> On Sun, 2 Feb 2014, Dominik Mostowiec wrote:
>>> >> >>> >>>
>>> >> >>> >>>> Hi,
>>> >> >>> >>>> After scrubbing almost all PGs has equal(~) num of objects.
>>> >> >>> >>>> I found something else.
>>> >> >>> >>>> On one host PG coun on OSDs:
>>> >> >>> >>>> OSD with small(52%) disk usage:
>>> >> >>> >>>> count, pool
>>> >> >>> >>>>     105 3
>>> >> >>> >>>>      18 4
>>> >> >>> >>>>       3 5
>>> >> >>> >>>>
>>> >> >>> >>>> Osd with larger(74%) disk usage:
>>> >> >>> >>>>     144 3
>>> >> >>> >>>>      31 4
>>> >> >>> >>>>       2 5
>>> >> >>> >>>>
>>> >> >>> >>>> Pool 3 is .rgw.buckets (where is almost of all data).
>>> >> >>> >>>> Pool 4 is .log, where is no data.
>>> >> >>> >>>>
>>> >> >>> >>>> Count of PGs shouldn't be the same per OSD ?
>>> >> >>> >>>> Or maybe PG hash algorithm is disrupted by wrong count of PG for pool
>>> >> >>> >>>> '4'. There is 1440 PGs ( this is not power of 2 ).
>>> >> >>> >>>>
>>> >> >>> >>>> ceph osd dump:
>>> >> >>> >>>> pool 0 'data' rep size 3 min_size 1 crush_ruleset 0 object_hash
>>> >> >>> >>>> rjenkins pg_num 64 pgp_num 64 last_change 28459 owner 0
>>> >> >>> >>>> crash_replay_interval 45
>>> >> >>> >>>> pool 1 'metadata' rep size 3 min_size 1 crush_ruleset 1 object_hash
>>> >> >>> >>>> rjenkins pg_num 64 pgp_num 64 last_change 28460 owner 0
>>> >> >>> >>>> pool 2 'rbd' rep size 3 min_size 1 crush_ruleset 2 object_hash
>>> >> >>> >>>> rjenkins pg_num 64 pgp_num 64 last_change 28461 owner 0
>>> >> >>> >>>> pool 3 '.rgw.buckets' rep size 3 min_size 1 crush_ruleset 0
>>> >> >>> >>>> object_hash rjenkins pg_num 8192 pgp_num 8192 last_change 73711 owner
>>> >> >>> >>>> 0
>>> >> >>> >>>> pool 4 '.log' rep size 3 min_size 1 crush_ruleset 0 object_hash
>>> >> >>> >>>> rjenkins pg_num 1440 pgp_num 1440 last_change 28463 owner 0
>>> >> >>> >>>> pool 5 '.rgw' rep size 3 min_size 1 crush_ruleset 0 object_hash
>>> >> >>> >>>> rjenkins pg_num 128 pgp_num 128 last_change 72467 owner 0
>>> >> >>> >>>> pool 6 '.users.uid' rep size 3 min_size 1 crush_ruleset 0 object_hash
>>> >> >>> >>>> rjenkins pg_num 8 pgp_num 8 last_change 28465 owner 0
>>> >> >>> >>>> pool 7 '.users' rep size 3 min_size 1 crush_ruleset 0 object_hash
>>> >> >>> >>>> rjenkins pg_num 8 pgp_num 8 last_change 28466 owner 0
>>> >> >>> >>>> pool 8 '.usage' rep size 2 min_size 1 crush_ruleset 0 object_hash
>>> >> >>> >>>> rjenkins pg_num 8 pgp_num 8 last_change 28467 owner
>>> >> >>> >>>> 18446744073709551615
>>> >> >>> >>>> pool 9 '.intent-log' rep size 3 min_size 1 crush_ruleset 0 object_hash
>>> >> >>> >>>> rjenkins pg_num 8 pgp_num 8 last_change 28468 owner
>>> >> >>> >>>> 18446744073709551615
>>> >> >>> >>>> pool 10 '.rgw.control' rep size 3 min_size 1 crush_ruleset 0
>>> >> >>> >>>> object_hash rjenkins pg_num 8 pgp_num 8 last_change 33485 owner
>>> >> >>> >>>> 18446744073709551615
>>> >> >>> >>>> pool 11 '.rgw.gc' rep size 3 min_size 1 crush_ruleset 0 object_hash
>>> >> >>> >>>> rjenkins pg_num 8 pgp_num 8 last_change 33487 owner
>>> >> >>> >>>> 18446744073709551615
>>> >> >>> >>>> pool 12 '.rgw.root' rep size 2 min_size 1 crush_ruleset 0 object_hash
>>> >> >>> >>>> rjenkins pg_num 8 pgp_num 8 last_change 44540 owner 0
>>> >> >>> >>>> pool 13 '' rep size 2 min_size 1 crush_ruleset 0 object_hash rjenkins
>>> >> >>> >>>> pg_num 8 pgp_num 8 last_change 46912 owner 0
>>> >> >>> >>>>
>>> >> >>> >>>> --
>>> >> >>> >>>> Regards
>>> >> >>> >>>> Dominik
>>> >> >>> >>>>
>>> >> >>> >>>> 2014-02-01 Dominik Mostowiec <dominikmostowiec@xxxxxxxxx>:
>>> >> >>> >>>> > Hi,
>>> >> >>> >>>> >> Did you bump pgp_num as well?
>>> >> >>> >>>> > Yes.
>>> >> >>> >>>> >
>>> >> >>> >>>> > See: http://dysk.onet.pl/link/BZ968
>>> >> >>> >>>> >
>>> >> >>> >>>> >> 25% pools is two times smaller from other.
>>> >> >>> >>>> > This is changing after scrubbing.
>>> >> >>> >>>> >
>>> >> >>> >>>> > --
>>> >> >>> >>>> > Regards
>>> >> >>> >>>> > Dominik
>>> >> >>> >>>> >
>>> >> >>> >>>> > 2014-02-01 Kyle Bader <kyle.bader@xxxxxxxxx>:
>>> >> >>> >>>> >>
>>> >> >>> >>>> >>> Change pg_num for .rgw.buckets to power of 2, an 'crush tunables
>>> >> >>> >>>> >>> optimal' didn't help :(
>>> >> >>> >>>> >>
>>> >> >>> >>>> >> Did you bump pgp_num as well? The split pgs will stay in place until pgp_num
>>> >> >>> >>>> >> is bumped as well, if you do this be prepared for (potentially lots) of data
>>> >> >>> >>>> >> movement.
>>> >> >>> >>>> >
>>> >> >>> >>>> >
>>> >> >>> >>>> >
>>> >> >>> >>>> > --
>>> >> >>> >>>> > Pozdrawiam
>>> >> >>> >>>> > Dominik
>>> >> >>> >>>>
>>> >> >>> >>>>
>>> >> >>> >>>>
>>> >> >>> >>>> --
>>> >> >>> >>>> Pozdrawiam
>>> >> >>> >>>> Dominik
>>> >> >>> >>>>
>>> >> >>> >>>>
>>> >> >>> >>
>>> >> >>> >>
>>> >> >>> >>
>>> >> >>> >> --
>>> >> >>> >> Pozdrawiam
>>> >> >>> >> Dominik
>>> >> >>> >
>>> >> >>> >
>>> >> >>> >
>>> >> >>> > --
>>> >> >>> > Pozdrawiam
>>> >> >>> > Dominik
>>> >> >>>
>>> >> >>>
>>> >> >>>
>>> >> >>> --
>>> >> >>> Pozdrawiam
>>> >> >>> Dominik
>>> >> >>>
>>> >> >>>
>>> >> >
>>> >> >
>>> >> >
>>> >> > --
>>> >> > Pozdrawiam
>>> >> > Dominik
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> Pozdrawiam
>>> >> Dominik
>>> >>
>>> >>
>>>
>>>
>>>
>>> --
>>> Pozdrawiam
>>> Dominik
>>>
>>>
>
>
>
> --
> Pozdrawiam
> Dominik

-- 
Pozdrawiam
Dominik
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com