On Tue, 21 Oct 2014, Lei Dong wrote: > Thanks Sage! > So you mean: > > 1. Choose step will not be affected by OSD weight(but only CRUSH weight). Yes, if the choose type is not 'osd'. > 2. Chooseleaf step will be affected by both the two weights. But with a > big variation in CRUSH weight and small OSD number, CRUSH works > inefficiently to make the distribution even although we can adjust OSD > weight. > > Right? Right. As a simple example, let's say we're picking 2 replicas and the weights are [1, 2, 1]. It's pretty obvious that the only two choices are a,b and b,c, but CRUSH will have a very hard time with this because it is doing an independent selection for each position. Things get harder as the number of replicas increases.. sage > > LeiDong > > On 10/20/14, 11:03 PM, "Sage Weil" <sage@xxxxxxxxxxxx> wrote: > > >On Mon, 20 Oct 2014, Lei Dong wrote: > >> Hi sage: > >> > >> As you said at https://github.com/ceph/ceph/pull/2199, adjusting weight > >>or > >> crush weight should be both effective at any case. We?ve encounter a > >>situation > >> in which it seems adjusting weight is far less effective than adjusting > >> crush weight. > >> > >> We use 6 racks with host number {9, 5, 9, 4, 9, 4} and 11 osds at each > >>host. > >> We created a crush rule for ec pool to survive rack failure: > >> > >> ruleset ecrule { > >> ? > >> min_size 11 > >> max_size 11 > >> step set_chooseleaf_tries 50 > >> step take default > >> step choose firstn 4 type rack // we want the distribution to be {3, 3, > >>3, > >> 2} for k=8 m=3 > >> step chooseleaf indep 3 type host > >> step emit > >> } > >> > >> After creation of the pool, we run osd reweight-by-pg many times, the > >>best > >> result it can reach is > >> Average PGs/OSD (expected): 225.28 > >> Max PGs/OSD: 307 > >> Min PGs/OSD: 164 > >> > >> Then we run our own tool to reweight(same strategy with reweight-by-pg, > >>just > >> adjust crush weight instead of weight), the best result is: > >> Average PGs/OSD (expected): 225.28 > >> Max PGs/OSD: 241 > >> Min PGs/OSD: 207 > >> Which is much better than the previous one. > >> > >> According to my understanding, due to uneven host numbers across rack, > >> for ?step choose firstn 4 type rack?: > >> 1. If we adjust osd weight, this step is almost unaffected and > >> will dispatch almost even pg number for each rack. Thus the host in > >>the > >> rack which have less host will take more pgs, no matter how we > >>adjust > >> weight. > >> 2. If we adjust osd crush weight, this step is affected and will try to > >> dispatch more pg to the rack which has higher crush weight value, > >>thus > >> the result can be even. > >> Am I right about this? > > > >I think so, yes. I am a bit surprised that this is a problem, though. > >We > >will still be distributing PGs based on the relative CRUSH weights, and I > >would not expect that the expected variation will lead to very much skew > >between racks. > > > >It may be that CRUSH is, at baseline, having trouble respecting your > >weights. You might try creating a single straw bucket with 6 OSDs and > >those weights (9, 5, 9, 4, 9, 4) and see if it is able to achieve a > >correct distribution. When there is a lot of variation in weights and > >the > >total number of items are small it can be hard for it to get to the right > >result. (We were just looking into a similar problem on another cluster > >on Friday.) > > > >For a more typical chooseleaf the osd weight will have the intended > >behavior, but when the initial step is a regular choose only the CRUSH > >weights affect the decision. My guess is that your process of skewing > >the > >CRUSH weights pretty dramatically which is able to compensate for the > >difficulty/improbability of randomly choosing racks with the right > >frequency... > > > >sage > > > >> We then do a further test with 6 racks and 9 hosts in each rack. In this > >> situation, adjusting weight or adjusting crush weight has almost the > >>same > >> effect. > >> > >> So, weight and crush weight do impact the result of CRUSH in a different > >> way? > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html