Hi Dominik, Can you send a copy of your osdmap? ceph osd getmap -o /tmp/osdmap (Can send it off list if the IP addresses are sensitive.) I'm tweaking osdmaptool to have a --test-map-pgs option to look at this offline. Thanks! sage On Mon, 3 Feb 2014, Dominik Mostowiec wrote: > In other words, > 1. we've got 3 racks ( 1 replica per rack ) > 2. in every rack we have 3 hosts > 3. every host has 22 OSD's > 4. all pg_num's are 2^n for every pool > 5. we enabled "crush tunables optimal". > 6. on every machine we disabled 4 unused disk's (osd out, osd reweight > 0 and osd rm) > > Pool ".rgw.buckets": one osd has 105 PGs and other one (on the same > machine) has 144 PGs (37% more!). > Other pools also have got this problem. It's not efficient placement. > > -- > Regards > Dominik > > > 2014-02-02 Dominik Mostowiec <dominikmostowiec@xxxxxxxxx>: > > Hi, > > For more info: > > crush: http://dysk.onet.pl/link/r4wGK > > osd_dump: http://dysk.onet.pl/link/I3YMZ > > pg_dump: http://dysk.onet.pl/link/4jkqM > > > > -- > > Regards > > Dominik > > > > 2014-02-02 Dominik Mostowiec <dominikmostowiec@xxxxxxxxx>: > >> Hi, > >> Hmm, > >> You think about sumarize PGs from different pools on one OSD's i think. > >> But for one pool (.rgw.buckets) where i have almost of all my data, PG > >> count on OSDs is aslo different. > >> For example 105 vs 144 PGs from pool .rgw.buckets. In first case it is > >> 52% disk usage, second 74%. > >> > >> -- > >> Regards > >> Dominik > >> > >> > >> 2014-02-02 Sage Weil <sage@xxxxxxxxxxx>: > >>> It occurs to me that this (and other unexplain variance reports) could > >>> easily be the 'hashpspool' flag not being set. The old behavior had the > >>> misfeature where consecutive pool's pg's would 'line up' on the same osds, > >>> so that 1.7 == 2.6 == 3.5 == 4.4 etc would map to the same nodes. This > >>> tends to 'amplify' any variance in the placement. The default is still to > >>> use the old behavior for compatibility (this will finally change in > >>> firefly). > >>> > >>> You can do > >>> > >>> ceph osd pool set <poolname> hashpspool true > >>> > >>> to enable the new placement logic on an existing pool, but be warned that > >>> this will rebalance *all* of the data in the pool, which can be a very > >>> heavyweight operation... > >>> > >>> sage > >>> > >>> > >>> On Sun, 2 Feb 2014, Dominik Mostowiec wrote: > >>> > >>>> Hi, > >>>> After scrubbing almost all PGs has equal(~) num of objects. > >>>> I found something else. > >>>> On one host PG coun on OSDs: > >>>> OSD with small(52%) disk usage: > >>>> count, pool > >>>> 105 3 > >>>> 18 4 > >>>> 3 5 > >>>> > >>>> Osd with larger(74%) disk usage: > >>>> 144 3 > >>>> 31 4 > >>>> 2 5 > >>>> > >>>> Pool 3 is .rgw.buckets (where is almost of all data). > >>>> Pool 4 is .log, where is no data. > >>>> > >>>> Count of PGs shouldn't be the same per OSD ? > >>>> Or maybe PG hash algorithm is disrupted by wrong count of PG for pool > >>>> '4'. There is 1440 PGs ( this is not power of 2 ). > >>>> > >>>> ceph osd dump: > >>>> pool 0 'data' rep size 3 min_size 1 crush_ruleset 0 object_hash > >>>> rjenkins pg_num 64 pgp_num 64 last_change 28459 owner 0 > >>>> crash_replay_interval 45 > >>>> pool 1 'metadata' rep size 3 min_size 1 crush_ruleset 1 object_hash > >>>> rjenkins pg_num 64 pgp_num 64 last_change 28460 owner 0 > >>>> pool 2 'rbd' rep size 3 min_size 1 crush_ruleset 2 object_hash > >>>> rjenkins pg_num 64 pgp_num 64 last_change 28461 owner 0 > >>>> pool 3 '.rgw.buckets' rep size 3 min_size 1 crush_ruleset 0 > >>>> object_hash rjenkins pg_num 8192 pgp_num 8192 last_change 73711 owner > >>>> 0 > >>>> pool 4 '.log' rep size 3 min_size 1 crush_ruleset 0 object_hash > >>>> rjenkins pg_num 1440 pgp_num 1440 last_change 28463 owner 0 > >>>> pool 5 '.rgw' rep size 3 min_size 1 crush_ruleset 0 object_hash > >>>> rjenkins pg_num 128 pgp_num 128 last_change 72467 owner 0 > >>>> pool 6 '.users.uid' rep size 3 min_size 1 crush_ruleset 0 object_hash > >>>> rjenkins pg_num 8 pgp_num 8 last_change 28465 owner 0 > >>>> pool 7 '.users' rep size 3 min_size 1 crush_ruleset 0 object_hash > >>>> rjenkins pg_num 8 pgp_num 8 last_change 28466 owner 0 > >>>> pool 8 '.usage' rep size 2 min_size 1 crush_ruleset 0 object_hash > >>>> rjenkins pg_num 8 pgp_num 8 last_change 28467 owner > >>>> 18446744073709551615 > >>>> pool 9 '.intent-log' rep size 3 min_size 1 crush_ruleset 0 object_hash > >>>> rjenkins pg_num 8 pgp_num 8 last_change 28468 owner > >>>> 18446744073709551615 > >>>> pool 10 '.rgw.control' rep size 3 min_size 1 crush_ruleset 0 > >>>> object_hash rjenkins pg_num 8 pgp_num 8 last_change 33485 owner > >>>> 18446744073709551615 > >>>> pool 11 '.rgw.gc' rep size 3 min_size 1 crush_ruleset 0 object_hash > >>>> rjenkins pg_num 8 pgp_num 8 last_change 33487 owner > >>>> 18446744073709551615 > >>>> pool 12 '.rgw.root' rep size 2 min_size 1 crush_ruleset 0 object_hash > >>>> rjenkins pg_num 8 pgp_num 8 last_change 44540 owner 0 > >>>> pool 13 '' rep size 2 min_size 1 crush_ruleset 0 object_hash rjenkins > >>>> pg_num 8 pgp_num 8 last_change 46912 owner 0 > >>>> > >>>> -- > >>>> Regards > >>>> Dominik > >>>> > >>>> 2014-02-01 Dominik Mostowiec <dominikmostowiec@xxxxxxxxx>: > >>>> > Hi, > >>>> >> Did you bump pgp_num as well? > >>>> > Yes. > >>>> > > >>>> > See: http://dysk.onet.pl/link/BZ968 > >>>> > > >>>> >> 25% pools is two times smaller from other. > >>>> > This is changing after scrubbing. > >>>> > > >>>> > -- > >>>> > Regards > >>>> > Dominik > >>>> > > >>>> > 2014-02-01 Kyle Bader <kyle.bader@xxxxxxxxx>: > >>>> >> > >>>> >>> Change pg_num for .rgw.buckets to power of 2, an 'crush tunables > >>>> >>> optimal' didn't help :( > >>>> >> > >>>> >> Did you bump pgp_num as well? The split pgs will stay in place until pgp_num > >>>> >> is bumped as well, if you do this be prepared for (potentially lots) of data > >>>> >> movement. > >>>> > > >>>> > > >>>> > > >>>> > -- > >>>> > Pozdrawiam > >>>> > Dominik > >>>> > >>>> > >>>> > >>>> -- > >>>> Pozdrawiam > >>>> Dominik > >>>> > >>>> > >> > >> > >> > >> -- > >> Pozdrawiam > >> Dominik > > > > > > > > -- > > Pozdrawiam > > Dominik > > > > -- > Pozdrawiam > Dominik > > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com