Forwarding per Sage's suggestion. -----Original Message----- From: Sage Weil [mailto:sage@xxxxxxxxxxx] Sent: Wednesday, March 19, 2014 11:29 PM To: Mark Nelson Cc: Zhang, Jian; Duan, Jiangang; He, Yujie Subject: Re: CURSH optimization for unbalanced pg distribution On Wed, 19 Mar 2014, Mark Nelson wrote: > On 03/19/2014 03:24 AM, Zhang, Jian wrote: > > For more detail data, please refer to the *Testing results* part. > > > > *Optimization proposals: * > > > > After we dived into the source code of CRUSH and related papers, we > > proposed two possible optimizations: > > > > 1.Add different hash algorithms, as an alternative for the Jenkin's > > hash, e.g. algorithm that will produce even values when range of > > input value (pg#) is relatively small. Or add new bucket type at the > > same time if necessary. This *might* work, but I don't have a strong intuition about it. The modeling we've done now has essentially assumed a statistically uniform distribution, which has some inherent inbalance for low values of n (num pgs in our case). I have generally assumed we can't do better than "random", and still have the other properties we want (independent, deterministic placement), but it may be possible. > > > > 2.Find a better replica placement strategy instead of current retry > > logic of crush_choose_firstn, which may cause CRUSH to behave badly. > > > > We find there are several threshold of retry times by referring to code, > > choose_total_tries, choose_local_tries and choose_local_fallback_tries. > > They are used to decide whether to do a retry_bucket, retry_descent or > > use permutation to do an exhaustive bucket search. We are wondering if > > there is another retry strategy: > > > > a)Backtracking retry. Now the logic of crush_choose_firstn can only > > issue an retry either from the initial bucket(retry_descent) or from the > > current bucket (retry_bucket), how about retrying the intervening buckets? > > > > b)Adjust threshold of retry times by other values. We do noticed that > > the 'optimal' crush tunable could be used to make it, but we still > > encounter unbalanced [g distribution by using the optimal strategy. > > Please refer to 4 of the Testing results part. > > > > c)Add an mechanism that can adjust above mentioned thresholds > > adaptively. Maybe we can record the retry times of the previous call for > > CRUSH, and adjust retry thresholds automatically according to the record. I suggest ignoring all of this retry logic. The original version of CRUSH has the local retries to try to make data move "less far", but when we went back a year ago and did a statistical analysis of the distribution we found that *all* of these hacks degraded the quality of the placement,a nd by turning them all off (setting the 'optimal' values which zeroes them all out excent for total_retries) we got something that was indistinguishable from a uniform distribution. > > 3.Add soft link for pg directories. During pg creation, we can create > > soft links for the pgs if pg# on the selected osd is more than some > > threshold, say 10% more than desired average number, to move objects > > that will be stored in this pg to another osd. Balanced disk utilization > > may be gained in this way. I think you need to be careful, but yes, this is an option. There is a similar exception mechanism in place that is used for other purposes and something similar could be done here. The main challenge will be in ensuring that the soft links/exceptions follow the same overall policy that CRUSH does after the raw mapping is performed. This is an option, but I would put it toward the bottom of the list... > > 4.Change placement strategy only for step of selecting devices from > > hosts. We found in our testing results that pg distribution was balanced > > among hosts, which is reasonable since pg# of each host is above 1K > > (according to the current BKM that pg# per osd should be about 100). So > > how about we apply CRUSH only on the interval buckets and find another > > simple but more balanced method to choose osd from host? This idea has a lot of potential. For example: If you know the chassis can hold 12 disks, you can force the bucket size to twelve and somehow prevent users from adjusting the structure of the tree. Then you can use a simple mapping that is truly flat (like a linear mapping, disk = x % num_disks) for that bucket/subtree. The downside of course is that if you remove a disk *everything* reshuffles, hence some sort of guardrails to prevent a user from inadvertantly doing that. If a disk *does* fail, you just need to make sure the disk is marked "out" but not removed from the CRUSH hierarchy and the normal retry will kick in. Note that all this is reall doing is increasing the size of the "buckets" that we are (pseudo)randomly distribution over. It is still a random/uniform distribution, but the N value is 12 times bigger (for a 12 disk chassis) and as a result the variance is substantially lower. I would suggest making a new bucket type that is called 'linear' and does a simple modulo and trying this out. We will need a bunch of additional safety checks to help users avoid doing silly things (like adjusting the number of items in the linear buckets, which reshuffle everything) but that wouldn't be needed for an initial analysis of the performance impact. Do you mind if we shift this thread over to ceph-devel? I think there are lots of people who would be interested in this discussion. We can of course leave off your attachment if you prefer. Thanks! sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html