Hi, > FWIW the tunable that fixes this was just merged today but won't > appear in a release for another 3 weeks or so. This is "vary_r tunable" ? Can I use this in production? -- Regards Dominik 2014-02-12 3:24 GMT+01:00 Sage Weil <sage@xxxxxxxxxxx>: > On Wed, 12 Feb 2014, Dominik Mostowiec wrote: >> Hi, >> If this problem (with stucked active+remapped pgs after >> reweight-by-utilisation) affects all ceph configurations or only >> specific ones? >> If specific: what is the reason in my case? Is this caused by crush >> configuration (cluster architecture, crush tunnables, ...), cluster >> size, architecture design mistakes, or something else? > > It seems to just be the particular structure of your map. In your case > you have a few different racks (or hosts? I forget) in the upper level up > the hierarchy and then a handful of devices in the leaves that are marked > out or reweighted down. With that combination CRUSH runs out of placement > choices at the upper level and keeps trying the same values in the lower > level. FWIW the tunable that fixes this was just merged today but won't > appear in a release for another 3 weeks or so. > >> Second question. >> Distribution PGs on OSDs is better for large clusters (where pg_num is >> higher). It is possible(for small clusters) to chagne crush >> distribution algorithm to more linear? (I realize that it will be less >> efficient). > > It really related to the ratio of pg_num to total OSDs, not the absolute > number. For small clusters it is probably more tolerable to have a larger > pg_num count though because many of the costs normally associated with > that (e.g., more peers) run up against the total host count before they > start to matter. > > Again, I think the right answer here is picking a good pg to osd ratio and > using reweight-by-utilization (which will be fixed soon). > > sage > > >> >> -- >> Regards >> Dominik >> >> 2014-02-06 21:31 GMT+01:00 Dominik Mostowiec <dominikmostowiec@xxxxxxxxx>: >> > Great! >> > Thanks for Your help. >> > >> > -- >> > Regards >> > Dominik >> > >> > 2014-02-06 21:10 GMT+01:00 Sage Weil <sage@xxxxxxxxxxx>: >> >> On Thu, 6 Feb 2014, Dominik Mostowiec wrote: >> >>> Hi, >> >>> Thanks !! >> >>> Can You suggest any workaround for now? >> >> >> >> You can adjust the crush weights on the overfull nodes slightly. You'd >> >> need to do it by hand, but that will do the trick. For example, >> >> >> >> ceph osd crush reweight osd.123 .96 >> >> >> >> (if the current weight is 1.0). >> >> >> >> sage >> >> >> >>> >> >>> -- >> >>> Regards >> >>> Dominik >> >>> >> >>> >> >>> 2014-02-06 18:39 GMT+01:00 Sage Weil <sage@xxxxxxxxxxx>: >> >>> > Hi, >> >>> > >> >>> > Just an update here. Another user saw this and after playing with it I >> >>> > identified a problem with CRUSH. There is a branch outstanding >> >>> > (wip-crush) that is pending review, but it's not a quick fix because of >> >>> > compatibility issues. >> >>> > >> >>> > sage >> >>> > >> >>> > >> >>> > On Thu, 6 Feb 2014, Dominik Mostowiec wrote: >> >>> > >> >>> >> Hi, >> >>> >> Mabye this info can help to find what is wrong. >> >>> >> For one PG (3.1e4a) which is active+remapped: >> >>> >> { "state": "active+remapped", >> >>> >> "epoch": 96050, >> >>> >> "up": [ >> >>> >> 119, >> >>> >> 69], >> >>> >> "acting": [ >> >>> >> 119, >> >>> >> 69, >> >>> >> 7], >> >>> >> Logs: >> >>> >> On osd.7: >> >>> >> 2014-02-04 09:45:54.966913 7fa618afe700 1 osd.7 pg_epoch: 94460 >> >>> >> pg[3.1e4a( v 94459'207004 (72275'204004,94459'207004] local-les=93486 >> >>> >> n=6718 ec=4 les/c 93486/93486 94460/94460/92233) [119,69] r=-1 >> >>> >> lpr=94460 pi=92546-94459/5 lcod 94459'207003 inactive NOTIFY] >> >>> >> state<Start>: transitioning to Stray >> >>> >> 2014-02-04 09:45:55.781278 7fa6172fb700 1 osd.7 pg_epoch: 94461 >> >>> >> pg[3.1e4a( v 94459'207004 (72275'204004,94459'207004] local-les=93486 >> >>> >> n=6718 ec=4 les/c 93486/93486 94460/94461/92233) >> >>> >> [119,69]/[119,69,7,142] r=2 lpr=94461 pi=92546-94460/6 lcod >> >>> >> 94459'207003 remapped NOTIFY] state<Start>: transitioning to Stray >> >>> >> 2014-02-04 09:49:01.124510 7fa618afe700 1 osd.7 pg_epoch: 94495 >> >>> >> pg[3.1e4a( v 94459'207004 (72275'204004,94459'207004] local-les=94462 >> >>> >> n=6718 ec=4 les/c 94462/94494 94460/94495/92233) [119,69]/[119,69,7] >> >>> >> r=2 lpr=94495 pi=92546-94494/7 lcod 94459'207003 remapped] >> >>> >> state<Start>: transitioning to Stray >> >>> >> >> >>> >> On osd.119: >> >>> >> 2014-02-04 09:45:54.981707 7f37f07c5700 1 osd.119 pg_epoch: 94460 >> >>> >> pg[3.1e4a( v 94459'207004 (72275'204004,94459'207004] local-les=93486 >> >>> >> n=6718 ec=4 les/c 93486/93486 94460/94460/92233) [119,69] r=0 >> >>> >> lpr=94460 pi=93485-94459/1 mlcod 0'0 inactive] state<Start>: >> >>> >> transitioning to Primary >> >>> >> 2014-02-04 09:45:55.805712 7f37ecfbe700 1 osd.119 pg_epoch: 94461 >> >>> >> pg[3.1e4a( v 94459'207004 (72275'204004,94459'207004] local-les=93486 >> >>> >> n=6718 ec=4 les/c 93486/93486 94460/94461/92233) >> >>> >> [119,69]/[119,69,7,142] r=0 lpr=94461 pi=93485-94460/2 mlcod 0'0 >> >>> >> remapped] state<Start>: transitioning to Primary >> >>> >> 2014-02-04 09:45:56.794015 7f37edfc0700 0 log [INF] : 3.1e4a >> >>> >> restarting backfill on osd.69 from (0'0,0'0] MAX to 94459'207004 >> >>> >> 2014-02-04 09:49:01.156627 7f37ef7c3700 1 osd.119 pg_epoch: 94495 >> >>> >> pg[3.1e4a( v 94459'207004 (72275'204004,94459'207004] local-les=94462 >> >>> >> n=6718 ec=4 les/c 94462/94494 94460/94495/92233) [119,69]/[119,69,7] >> >>> >> r=0 lpr=94495 pi=94461-94494/1 mlcod 0'0 remapped] state<Start>: >> >>> >> transitioning to Primary >> >>> >> >> >>> >> On osd.69: >> >>> >> 2014-02-04 09:45:56.845695 7f2231372700 1 osd.69 pg_epoch: 94462 >> >>> >> pg[3.1e4a( empty local-les=0 n=0 ec=4 les/c 93486/93486 >> >>> >> 94460/94461/92233) [119,69]/[119,69,7,142] r=1 lpr=94462 >> >>> >> pi=93485-94460/2 inactive] state<Start>: transitioning to Stray >> >>> >> 2014-02-04 09:49:01.153695 7f2229b63700 1 osd.69 pg_epoch: 94495 >> >>> >> pg[3.1e4a( v 94459'207004 (72275'204004,94459'207004] local-les=94462 >> >>> >> n=6718 ec=4 les/c 94462/94494 94460/94495/92233) [119,69]/[119,69,7] >> >>> >> r=1 lpr=94495 pi=93485-94494/3 remapped] state<Start>: transitioning >> >>> >> to Stray >> >>> >> >> >>> >> pq query recovery state: >> >>> >> "recovery_state": [ >> >>> >> { "name": "Started\/Primary\/Active", >> >>> >> "enter_time": "2014-02-04 09:49:02.070724", >> >>> >> "might_have_unfound": [], >> >>> >> "recovery_progress": { "backfill_target": -1, >> >>> >> "waiting_on_backfill": 0, >> >>> >> "backfill_pos": "0\/\/0\/\/-1", >> >>> >> "backfill_info": { "begin": "0\/\/0\/\/-1", >> >>> >> "end": "0\/\/0\/\/-1", >> >>> >> "objects": []}, >> >>> >> "peer_backfill_info": { "begin": "0\/\/0\/\/-1", >> >>> >> "end": "0\/\/0\/\/-1", >> >>> >> "objects": []}, >> >>> >> "backfills_in_flight": [], >> >>> >> "pull_from_peer": [], >> >>> >> "pushing": []}, >> >>> >> "scrub": { "scrubber.epoch_start": "77502", >> >>> >> "scrubber.active": 0, >> >>> >> "scrubber.block_writes": 0, >> >>> >> "scrubber.finalizing": 0, >> >>> >> "scrubber.waiting_on": 0, >> >>> >> "scrubber.waiting_on_whom": []}}, >> >>> >> { "name": "Started", >> >>> >> "enter_time": "2014-02-04 09:49:01.156626"}]} >> >>> >> >> >>> >> --- >> >>> >> Regards >> >>> >> Dominik >> >>> >> >> >>> >> 2014-02-04 12:09 GMT+01:00 Dominik Mostowiec <dominikmostowiec@xxxxxxxxx>: >> >>> >> > Hi, >> >>> >> > Thanks for Your help !! >> >>> >> > We've done again 'ceph osd reweight-by-utilization 105' >> >>> >> > Cluster stack on 10387 active+clean, 237 active+remapped; >> >>> >> > More info in attachments. >> >>> >> > >> >>> >> > -- >> >>> >> > Regards >> >>> >> > Dominik >> >>> >> > >> >>> >> > >> >>> >> > 2014-02-04 Sage Weil <sage@xxxxxxxxxxx>: >> >>> >> >> Hi, >> >>> >> >> >> >>> >> >> I spent a couple hours looking at your map because it did look like there >> >>> >> >> was something wrong. After some experimentation and adding a bucnh of >> >>> >> >> improvements to osdmaptool to test the distribution, though, I think >> >>> >> >> everything is working as expected. For pool 3, your map has a standard >> >>> >> >> deviation in utilizations of ~8%, and we should expect ~9% for this number >> >>> >> >> of PGs. For all pools, it is slightly higher (~9% vs expected ~8%). >> >>> >> >> This is either just in the noise, or slightly confounded by the lack of >> >>> >> >> the hashpspool flag on the pools (which slightly amplifies placement >> >>> >> >> nonuniformity with multiple pools... not enough that it is worth changing >> >>> >> >> anything though). >> >>> >> >> >> >>> >> >> The bad news is that that order of standard deviation results in pretty >> >>> >> >> wide min/max range of 118 to 202 pgs. That seems a *bit* higher than we a >> >>> >> >> perfectly random placement generates (I'm seeing a spread in that is >> >>> >> >> usually 50-70 pgs), but I think *that* is where the pool overlap (no >> >>> >> >> hashpspool) is rearing its head; for just pool three the spread of 50 is >> >>> >> >> about what is expected. >> >>> >> >> >> >>> >> >> Long story short: you have two options. One is increasing the number of >> >>> >> >> PGs. Note that this helps but has diminishing returns (doubling PGs >> >>> >> >> only takes you from ~8% to ~6% standard deviation, quadrupling to ~4%). >> >>> >> >> >> >>> >> >> The other is to use reweight-by-utilization. That is the best approach, >> >>> >> >> IMO. I'm not sure why you were seeing PGs stuck in the remapped state >> >>> >> >> after you did that, though, but I'm happy to dig into that too. >> >>> >> >> >> >>> >> >> BTW, the osdmaptool addition I was using to play with is here: >> >>> >> >> https://github.com/ceph/ceph/pull/1178 >> >>> >> >> >> >>> >> >> sage >> >>> >> >> >> >>> >> >> >> >>> >> >> On Mon, 3 Feb 2014, Dominik Mostowiec wrote: >> >>> >> >> >> >>> >> >>> In other words, >> >>> >> >>> 1. we've got 3 racks ( 1 replica per rack ) >> >>> >> >>> 2. in every rack we have 3 hosts >> >>> >> >>> 3. every host has 22 OSD's >> >>> >> >>> 4. all pg_num's are 2^n for every pool >> >>> >> >>> 5. we enabled "crush tunables optimal". >> >>> >> >>> 6. on every machine we disabled 4 unused disk's (osd out, osd reweight >> >>> >> >>> 0 and osd rm) >> >>> >> >>> >> >>> >> >>> Pool ".rgw.buckets": one osd has 105 PGs and other one (on the same >> >>> >> >>> machine) has 144 PGs (37% more!). >> >>> >> >>> Other pools also have got this problem. It's not efficient placement. >> >>> >> >>> >> >>> >> >>> -- >> >>> >> >>> Regards >> >>> >> >>> Dominik >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> 2014-02-02 Dominik Mostowiec <dominikmostowiec@xxxxxxxxx>: >> >>> >> >>> > Hi, >> >>> >> >>> > For more info: >> >>> >> >>> > crush: http://dysk.onet.pl/link/r4wGK >> >>> >> >>> > osd_dump: http://dysk.onet.pl/link/I3YMZ >> >>> >> >>> > pg_dump: http://dysk.onet.pl/link/4jkqM >> >>> >> >>> > >> >>> >> >>> > -- >> >>> >> >>> > Regards >> >>> >> >>> > Dominik >> >>> >> >>> > >> >>> >> >>> > 2014-02-02 Dominik Mostowiec <dominikmostowiec@xxxxxxxxx>: >> >>> >> >>> >> Hi, >> >>> >> >>> >> Hmm, >> >>> >> >>> >> You think about sumarize PGs from different pools on one OSD's i think. >> >>> >> >>> >> But for one pool (.rgw.buckets) where i have almost of all my data, PG >> >>> >> >>> >> count on OSDs is aslo different. >> >>> >> >>> >> For example 105 vs 144 PGs from pool .rgw.buckets. In first case it is >> >>> >> >>> >> 52% disk usage, second 74%. >> >>> >> >>> >> >> >>> >> >>> >> -- >> >>> >> >>> >> Regards >> >>> >> >>> >> Dominik >> >>> >> >>> >> >> >>> >> >>> >> >> >>> >> >>> >> 2014-02-02 Sage Weil <sage@xxxxxxxxxxx>: >> >>> >> >>> >>> It occurs to me that this (and other unexplain variance reports) could >> >>> >> >>> >>> easily be the 'hashpspool' flag not being set. The old behavior had the >> >>> >> >>> >>> misfeature where consecutive pool's pg's would 'line up' on the same osds, >> >>> >> >>> >>> so that 1.7 == 2.6 == 3.5 == 4.4 etc would map to the same nodes. This >> >>> >> >>> >>> tends to 'amplify' any variance in the placement. The default is still to >> >>> >> >>> >>> use the old behavior for compatibility (this will finally change in >> >>> >> >>> >>> firefly). >> >>> >> >>> >>> >> >>> >> >>> >>> You can do >> >>> >> >>> >>> >> >>> >> >>> >>> ceph osd pool set <poolname> hashpspool true >> >>> >> >>> >>> >> >>> >> >>> >>> to enable the new placement logic on an existing pool, but be warned that >> >>> >> >>> >>> this will rebalance *all* of the data in the pool, which can be a very >> >>> >> >>> >>> heavyweight operation... >> >>> >> >>> >>> >> >>> >> >>> >>> sage >> >>> >> >>> >>> >> >>> >> >>> >>> >> >>> >> >>> >>> On Sun, 2 Feb 2014, Dominik Mostowiec wrote: >> >>> >> >>> >>> >> >>> >> >>> >>>> Hi, >> >>> >> >>> >>>> After scrubbing almost all PGs has equal(~) num of objects. >> >>> >> >>> >>>> I found something else. >> >>> >> >>> >>>> On one host PG coun on OSDs: >> >>> >> >>> >>>> OSD with small(52%) disk usage: >> >>> >> >>> >>>> count, pool >> >>> >> >>> >>>> 105 3 >> >>> >> >>> >>>> 18 4 >> >>> >> >>> >>>> 3 5 >> >>> >> >>> >>>> >> >>> >> >>> >>>> Osd with larger(74%) disk usage: >> >>> >> >>> >>>> 144 3 >> >>> >> >>> >>>> 31 4 >> >>> >> >>> >>>> 2 5 >> >>> >> >>> >>>> >> >>> >> >>> >>>> Pool 3 is .rgw.buckets (where is almost of all data). >> >>> >> >>> >>>> Pool 4 is .log, where is no data. >> >>> >> >>> >>>> >> >>> >> >>> >>>> Count of PGs shouldn't be the same per OSD ? >> >>> >> >>> >>>> Or maybe PG hash algorithm is disrupted by wrong count of PG for pool >> >>> >> >>> >>>> '4'. There is 1440 PGs ( this is not power of 2 ). >> >>> >> >>> >>>> >> >>> >> >>> >>>> ceph osd dump: >> >>> >> >>> >>>> pool 0 'data' rep size 3 min_size 1 crush_ruleset 0 object_hash >> >>> >> >>> >>>> rjenkins pg_num 64 pgp_num 64 last_change 28459 owner 0 >> >>> >> >>> >>>> crash_replay_interval 45 >> >>> >> >>> >>>> pool 1 'metadata' rep size 3 min_size 1 crush_ruleset 1 object_hash >> >>> >> >>> >>>> rjenkins pg_num 64 pgp_num 64 last_change 28460 owner 0 >> >>> >> >>> >>>> pool 2 'rbd' rep size 3 min_size 1 crush_ruleset 2 object_hash >> >>> >> >>> >>>> rjenkins pg_num 64 pgp_num 64 last_change 28461 owner 0 >> >>> >> >>> >>>> pool 3 '.rgw.buckets' rep size 3 min_size 1 crush_ruleset 0 >> >>> >> >>> >>>> object_hash rjenkins pg_num 8192 pgp_num 8192 last_change 73711 owner >> >>> >> >>> >>>> 0 >> >>> >> >>> >>>> pool 4 '.log' rep size 3 min_size 1 crush_ruleset 0 object_hash >> >>> >> >>> >>>> rjenkins pg_num 1440 pgp_num 1440 last_change 28463 owner 0 >> >>> >> >>> >>>> pool 5 '.rgw' rep size 3 min_size 1 crush_ruleset 0 object_hash >> >>> >> >>> >>>> rjenkins pg_num 128 pgp_num 128 last_change 72467 owner 0 >> >>> >> >>> >>>> pool 6 '.users.uid' rep size 3 min_size 1 crush_ruleset 0 object_hash >> >>> >> >>> >>>> rjenkins pg_num 8 pgp_num 8 last_change 28465 owner 0 >> >>> >> >>> >>>> pool 7 '.users' rep size 3 min_size 1 crush_ruleset 0 object_hash >> >>> >> >>> >>>> rjenkins pg_num 8 pgp_num 8 last_change 28466 owner 0 >> >>> >> >>> >>>> pool 8 '.usage' rep size 2 min_size 1 crush_ruleset 0 object_hash >> >>> >> >>> >>>> rjenkins pg_num 8 pgp_num 8 last_change 28467 owner >> >>> >> >>> >>>> 18446744073709551615 >> >>> >> >>> >>>> pool 9 '.intent-log' rep size 3 min_size 1 crush_ruleset 0 object_hash >> >>> >> >>> >>>> rjenkins pg_num 8 pgp_num 8 last_change 28468 owner >> >>> >> >>> >>>> 18446744073709551615 >> >>> >> >>> >>>> pool 10 '.rgw.control' rep size 3 min_size 1 crush_ruleset 0 >> >>> >> >>> >>>> object_hash rjenkins pg_num 8 pgp_num 8 last_change 33485 owner >> >>> >> >>> >>>> 18446744073709551615 >> >>> >> >>> >>>> pool 11 '.rgw.gc' rep size 3 min_size 1 crush_ruleset 0 object_hash >> >>> >> >>> >>>> rjenkins pg_num 8 pgp_num 8 last_change 33487 owner >> >>> >> >>> >>>> 18446744073709551615 >> >>> >> >>> >>>> pool 12 '.rgw.root' rep size 2 min_size 1 crush_ruleset 0 object_hash >> >>> >> >>> >>>> rjenkins pg_num 8 pgp_num 8 last_change 44540 owner 0 >> >>> >> >>> >>>> pool 13 '' rep size 2 min_size 1 crush_ruleset 0 object_hash rjenkins >> >>> >> >>> >>>> pg_num 8 pgp_num 8 last_change 46912 owner 0 >> >>> >> >>> >>>> >> >>> >> >>> >>>> -- >> >>> >> >>> >>>> Regards >> >>> >> >>> >>>> Dominik >> >>> >> >>> >>>> >> >>> >> >>> >>>> 2014-02-01 Dominik Mostowiec <dominikmostowiec@xxxxxxxxx>: >> >>> >> >>> >>>> > Hi, >> >>> >> >>> >>>> >> Did you bump pgp_num as well? >> >>> >> >>> >>>> > Yes. >> >>> >> >>> >>>> > >> >>> >> >>> >>>> > See: http://dysk.onet.pl/link/BZ968 >> >>> >> >>> >>>> > >> >>> >> >>> >>>> >> 25% pools is two times smaller from other. >> >>> >> >>> >>>> > This is changing after scrubbing. >> >>> >> >>> >>>> > >> >>> >> >>> >>>> > -- >> >>> >> >>> >>>> > Regards >> >>> >> >>> >>>> > Dominik >> >>> >> >>> >>>> > >> >>> >> >>> >>>> > 2014-02-01 Kyle Bader <kyle.bader@xxxxxxxxx>: >> >>> >> >>> >>>> >> >> >>> >> >>> >>>> >>> Change pg_num for .rgw.buckets to power of 2, an 'crush tunables >> >>> >> >>> >>>> >>> optimal' didn't help :( >> >>> >> >>> >>>> >> >> >>> >> >>> >>>> >> Did you bump pgp_num as well? The split pgs will stay in place until pgp_num >> >>> >> >>> >>>> >> is bumped as well, if you do this be prepared for (potentially lots) of data >> >>> >> >>> >>>> >> movement. >> >>> >> >>> >>>> > >> >>> >> >>> >>>> > >> >>> >> >>> >>>> > >> >>> >> >>> >>>> > -- >> >>> >> >>> >>>> > Pozdrawiam >> >>> >> >>> >>>> > Dominik >> >>> >> >>> >>>> >> >>> >> >>> >>>> >> >>> >> >>> >>>> >> >>> >> >>> >>>> -- >> >>> >> >>> >>>> Pozdrawiam >> >>> >> >>> >>>> Dominik >> >>> >> >>> >>>> >> >>> >> >>> >>>> >> >>> >> >>> >> >> >>> >> >>> >> >> >>> >> >>> >> >> >>> >> >>> >> -- >> >>> >> >>> >> Pozdrawiam >> >>> >> >>> >> Dominik >> >>> >> >>> > >> >>> >> >>> > >> >>> >> >>> > >> >>> >> >>> > -- >> >>> >> >>> > Pozdrawiam >> >>> >> >>> > Dominik >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> -- >> >>> >> >>> Pozdrawiam >> >>> >> >>> Dominik >> >>> >> >>> >> >>> >> >>> >> >>> >> > >> >>> >> > >> >>> >> > >> >>> >> > -- >> >>> >> > Pozdrawiam >> >>> >> > Dominik >> >>> >> >> >>> >> >> >>> >> >> >>> >> -- >> >>> >> Pozdrawiam >> >>> >> Dominik >> >>> >> >> >>> >> >> >>> >> >>> >> >>> >> >>> -- >> >>> Pozdrawiam >> >>> Dominik >> >>> >> >>> >> > >> > >> > >> > -- >> > Pozdrawiam >> > Dominik >> >> >> >> -- >> Pozdrawiam >> Dominik >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majordomo@xxxxxxxxxxxxxxx >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> -- Pozdrawiam Dominik -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html