Re: poor data distribution

Sage Weil <sage@xxxxxxxxxxx> · Mon, 3 Feb 2014 10:34:12 -0800 (PST)

On Mon, 3 Feb 2014, Dominik Mostowiec wrote:
> Sory, i forgot to tell You.
> It can be important.
> We done:
> ceph osd reweight-by-utilization 105 ( as i wrote in second mail ).
> and after cluster stack on 'active+remapped' PGs we had to reweight it
> back to 1.0. (all reweighted osd's)
> This osdmap is not from active+clean cluster, rebalancing is in progress.
> If you need i'll send you osdmap from clean cluster. Let me know.

A clean osdmap would be helpful.

Thanks!
sage

> 
> --
> Regards
> Dominik
> 
> 
> 
> 
> 2014-02-03 Dominik Mostowiec <dominikmostowiec@xxxxxxxxx>:
> > Hi,
> > Thanks,
> > In attachement.
> >
> >
> > --
> > Regards
> > Dominik
> >
> >
> > 2014-02-03 Sage Weil <sage@xxxxxxxxxxx>:
> >> Hi Dominik,
> >>
> >> Can you send a copy of your osdmap?
> >>
> >>  ceph osd getmap -o /tmp/osdmap
> >>
> >> (Can send it off list if the IP addresses are sensitive.)  I'm tweaking
> >> osdmaptool to have a --test-map-pgs option to look at this offline.
> >>
> >> Thanks!
> >> sage
> >>
> >>
> >> On Mon, 3 Feb 2014, Dominik Mostowiec wrote:
> >>
> >>> In other words,
> >>> 1. we've got 3 racks ( 1 replica per rack )
> >>> 2. in every rack we have 3 hosts
> >>> 3. every host has 22 OSD's
> >>> 4. all pg_num's are 2^n for every pool
> >>> 5. we enabled "crush tunables optimal".
> >>> 6. on every machine we disabled 4 unused disk's (osd out, osd reweight
> >>> 0 and osd rm)
> >>>
> >>> Pool ".rgw.buckets": one osd has 105 PGs and other one (on the same
> >>> machine) has 144 PGs (37% more!).
> >>> Other pools also have got this problem. It's not efficient placement.
> >>>
> >>> --
> >>> Regards
> >>> Dominik
> >>>
> >>>
> >>> 2014-02-02 Dominik Mostowiec <dominikmostowiec@xxxxxxxxx>:
> >>> > Hi,
> >>> > For more info:
> >>> >   crush: http://dysk.onet.pl/link/r4wGK
> >>> >   osd_dump: http://dysk.onet.pl/link/I3YMZ
> >>> >   pg_dump: http://dysk.onet.pl/link/4jkqM
> >>> >
> >>> > --
> >>> > Regards
> >>> > Dominik
> >>> >
> >>> > 2014-02-02 Dominik Mostowiec <dominikmostowiec@xxxxxxxxx>:
> >>> >> Hi,
> >>> >> Hmm,
> >>> >> You think about sumarize PGs from different pools on one OSD's i think.
> >>> >> But for one pool (.rgw.buckets) where i have almost of all my data, PG
> >>> >> count on OSDs is aslo different.
> >>> >> For example 105 vs 144 PGs from pool .rgw.buckets. In first case it is
> >>> >> 52% disk usage, second 74%.
> >>> >>
> >>> >> --
> >>> >> Regards
> >>> >> Dominik
> >>> >>
> >>> >>
> >>> >> 2014-02-02 Sage Weil <sage@xxxxxxxxxxx>:
> >>> >>> It occurs to me that this (and other unexplain variance reports) could
> >>> >>> easily be the 'hashpspool' flag not being set.  The old behavior had the
> >>> >>> misfeature where consecutive pool's pg's would 'line up' on the same osds,
> >>> >>> so that 1.7 == 2.6 == 3.5 == 4.4 etc would map to the same nodes.  This
> >>> >>> tends to 'amplify' any variance in the placement.  The default is still to
> >>> >>> use the old behavior for compatibility (this will finally change in
> >>> >>> firefly).
> >>> >>>
> >>> >>> You can do
> >>> >>>
> >>> >>>  ceph osd pool set <poolname> hashpspool true
> >>> >>>
> >>> >>> to enable the new placement logic on an existing pool, but be warned that
> >>> >>> this will rebalance *all* of the data in the pool, which can be a very
> >>> >>> heavyweight operation...
> >>> >>>
> >>> >>> sage
> >>> >>>
> >>> >>>
> >>> >>> On Sun, 2 Feb 2014, Dominik Mostowiec wrote:
> >>> >>>
> >>> >>>> Hi,
> >>> >>>> After scrubbing almost all PGs has equal(~) num of objects.
> >>> >>>> I found something else.
> >>> >>>> On one host PG coun on OSDs:
> >>> >>>> OSD with small(52%) disk usage:
> >>> >>>> count, pool
> >>> >>>>     105 3
> >>> >>>>      18 4
> >>> >>>>       3 5
> >>> >>>>
> >>> >>>> Osd with larger(74%) disk usage:
> >>> >>>>     144 3
> >>> >>>>      31 4
> >>> >>>>       2 5
> >>> >>>>
> >>> >>>> Pool 3 is .rgw.buckets (where is almost of all data).
> >>> >>>> Pool 4 is .log, where is no data.
> >>> >>>>
> >>> >>>> Count of PGs shouldn't be the same per OSD ?
> >>> >>>> Or maybe PG hash algorithm is disrupted by wrong count of PG for pool
> >>> >>>> '4'. There is 1440 PGs ( this is not power of 2 ).
> >>> >>>>
> >>> >>>> ceph osd dump:
> >>> >>>> pool 0 'data' rep size 3 min_size 1 crush_ruleset 0 object_hash
> >>> >>>> rjenkins pg_num 64 pgp_num 64 last_change 28459 owner 0
> >>> >>>> crash_replay_interval 45
> >>> >>>> pool 1 'metadata' rep size 3 min_size 1 crush_ruleset 1 object_hash
> >>> >>>> rjenkins pg_num 64 pgp_num 64 last_change 28460 owner 0
> >>> >>>> pool 2 'rbd' rep size 3 min_size 1 crush_ruleset 2 object_hash
> >>> >>>> rjenkins pg_num 64 pgp_num 64 last_change 28461 owner 0
> >>> >>>> pool 3 '.rgw.buckets' rep size 3 min_size 1 crush_ruleset 0
> >>> >>>> object_hash rjenkins pg_num 8192 pgp_num 8192 last_change 73711 owner
> >>> >>>> 0
> >>> >>>> pool 4 '.log' rep size 3 min_size 1 crush_ruleset 0 object_hash
> >>> >>>> rjenkins pg_num 1440 pgp_num 1440 last_change 28463 owner 0
> >>> >>>> pool 5 '.rgw' rep size 3 min_size 1 crush_ruleset 0 object_hash
> >>> >>>> rjenkins pg_num 128 pgp_num 128 last_change 72467 owner 0
> >>> >>>> pool 6 '.users.uid' rep size 3 min_size 1 crush_ruleset 0 object_hash
> >>> >>>> rjenkins pg_num 8 pgp_num 8 last_change 28465 owner 0
> >>> >>>> pool 7 '.users' rep size 3 min_size 1 crush_ruleset 0 object_hash
> >>> >>>> rjenkins pg_num 8 pgp_num 8 last_change 28466 owner 0
> >>> >>>> pool 8 '.usage' rep size 2 min_size 1 crush_ruleset 0 object_hash
> >>> >>>> rjenkins pg_num 8 pgp_num 8 last_change 28467 owner
> >>> >>>> 18446744073709551615
> >>> >>>> pool 9 '.intent-log' rep size 3 min_size 1 crush_ruleset 0 object_hash
> >>> >>>> rjenkins pg_num 8 pgp_num 8 last_change 28468 owner
> >>> >>>> 18446744073709551615
> >>> >>>> pool 10 '.rgw.control' rep size 3 min_size 1 crush_ruleset 0
> >>> >>>> object_hash rjenkins pg_num 8 pgp_num 8 last_change 33485 owner
> >>> >>>> 18446744073709551615
> >>> >>>> pool 11 '.rgw.gc' rep size 3 min_size 1 crush_ruleset 0 object_hash
> >>> >>>> rjenkins pg_num 8 pgp_num 8 last_change 33487 owner
> >>> >>>> 18446744073709551615
> >>> >>>> pool 12 '.rgw.root' rep size 2 min_size 1 crush_ruleset 0 object_hash
> >>> >>>> rjenkins pg_num 8 pgp_num 8 last_change 44540 owner 0
> >>> >>>> pool 13 '' rep size 2 min_size 1 crush_ruleset 0 object_hash rjenkins
> >>> >>>> pg_num 8 pgp_num 8 last_change 46912 owner 0
> >>> >>>>
> >>> >>>> --
> >>> >>>> Regards
> >>> >>>> Dominik
> >>> >>>>
> >>> >>>> 2014-02-01 Dominik Mostowiec <dominikmostowiec@xxxxxxxxx>:
> >>> >>>> > Hi,
> >>> >>>> >> Did you bump pgp_num as well?
> >>> >>>> > Yes.
> >>> >>>> >
> >>> >>>> > See: http://dysk.onet.pl/link/BZ968
> >>> >>>> >
> >>> >>>> >> 25% pools is two times smaller from other.
> >>> >>>> > This is changing after scrubbing.
> >>> >>>> >
> >>> >>>> > --
> >>> >>>> > Regards
> >>> >>>> > Dominik
> >>> >>>> >
> >>> >>>> > 2014-02-01 Kyle Bader <kyle.bader@xxxxxxxxx>:
> >>> >>>> >>
> >>> >>>> >>> Change pg_num for .rgw.buckets to power of 2, an 'crush tunables
> >>> >>>> >>> optimal' didn't help :(
> >>> >>>> >>
> >>> >>>> >> Did you bump pgp_num as well? The split pgs will stay in place until pgp_num
> >>> >>>> >> is bumped as well, if you do this be prepared for (potentially lots) of data
> >>> >>>> >> movement.
> >>> >>>> >
> >>> >>>> >
> >>> >>>> >
> >>> >>>> > --
> >>> >>>> > Pozdrawiam
> >>> >>>> > Dominik
> >>> >>>>
> >>> >>>>
> >>> >>>>
> >>> >>>> --
> >>> >>>> Pozdrawiam
> >>> >>>> Dominik
> >>> >>>>
> >>> >>>>
> >>> >>
> >>> >>
> >>> >>
> >>> >> --
> >>> >> Pozdrawiam
> >>> >> Dominik
> >>> >
> >>> >
> >>> >
> >>> > --
> >>> > Pozdrawiam
> >>> > Dominik
> >>>
> >>>
> >>>
> >>> --
> >>> Pozdrawiam
> >>> Dominik
> >>>
> >>>
> >
> >
> >
> > --
> > Pozdrawiam
> > Dominik
> 
> 
> 
> -- 
> Pozdrawiam
> Dominik
> 
> 
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com