On Tue, 22 Nov 2016, Adam C. Emerson wrote: > On 22/11/2016, Sage Weil wrote: > > Hi Adam, > > > > Sam had a suggestion about the CRUSH weight anomaly[1]. Instead of > > adjusting the weight for a given bucket based on an expected num_rep > > value, instead we could store a vector of weight values for every bucket > > in the tree for a range of num_reps (2..15, or whatever range is > > appropriate given the min_size/max_size values for the rules). In general > > the tools will show the normal weight (which is a sum of the children) but > > we'd also keep the adjusted values for any given num_rep and use those for > > the actual choose. > > > > What do you think? > > I thought of something along those lines, though it makes me a bit > uneasy. Right now, if I have a bunch of objects stored on a bunch of > hosts and we increase the replication count, objects migrate are > copied to the NEW hosts but don't migrate between hosts. (This is part > of the RUSH family monotonicity guarantee.) > > This seems like it might catch users by surprise and result in > undesired behavior. Having Ceph operate this way would violate the > /expectations/ people have from looking at a description of our > algorithm. > > Rather than having CRUSH automagically pick the distribution based on > the replication count, could we make it more explicit? I'm not sure > what the best form would be. We might have 'auxiliary weightings' in > straw2 and list buckets and a way for a CRUSH rule to select one of > the alternates. That way we wouldn't have to replicate the entire > hierarchy of devices, and people could 'opt in' explicitly. > > That might be a bit too fiddly, but I think you get the idea. I'm very > uneasy about having 'magic replication count' behavior sneak up on > people. Hmm. What if the weights are alternatives based on num_rep, but rather based on r (or rather which replica we are choosing). So the first choice is based on the actual weights. The second choice is based on the r=2 weights. The third choice is based on r=3 weights. In a perfect world, those weights are actually a function of our previous selection(s); in this case, since that is not computationally feasible (right?), we can precalculate values that are "average" for possible prior choices. It'd be a bit more complicated, but assume for a moment that we could do this... Then, if you move from 2x to 3x for a given mapping, the selection of the first 2 replicas won't be any different than before, but the third replica will use different effective weights. sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html