Re: FW: CURSH optimization for unbalanced pg distribution

Loic Dachary <loic@xxxxxxxxxxx> · Tue, 09 Sep 2014 15:36:18 +0200



On 20/03/2014 04:54, Zhang, Jian wrote:
> Forwarding per Sage's suggestion. 

Very interesting discussion :-) For the record the corresponding pull request is https://github.com/ceph/ceph/pull/2402

> 
> 
> -----Original Message-----
> From: Sage Weil [mailto:sage@xxxxxxxxxxx] 
> Sent: Wednesday, March 19, 2014 11:29 PM
> To: Mark Nelson
> Cc: Zhang, Jian; Duan, Jiangang; He, Yujie
> Subject: Re: CURSH optimization for unbalanced pg distribution
> 
> On Wed, 19 Mar 2014, Mark Nelson wrote:
>> On 03/19/2014 03:24 AM, Zhang, Jian wrote:
>>> For more detail data, please refer to the *Testing results* part.
>>>
>>> *Optimization proposals: *
>>>
>>> After we dived into the source code of CRUSH and related papers, we 
>>> proposed two possible optimizations:
>>>
>>> 1.Add different hash algorithms, as an alternative for the Jenkin's 
>>> hash, e.g. algorithm that will produce even values when range of 
>>> input value (pg#) is relatively small. Or add new bucket type at the 
>>> same time if necessary.
> 
> This *might* work, but I don't have a strong intuition about it.  The modeling we've done now has essentially assumed a statistically uniform distribution, which has some inherent inbalance for low values of n (num pgs in our case).  I have generally assumed we can't do better than "random", and still have the other properties we want (independent, deterministic placement), but it may be possible.
> 
>>>
>>> 2.Find a better replica placement strategy instead of current retry
>>> logic of crush_choose_firstn, which may cause CRUSH to behave badly.
>>>
>>> We find there are several threshold of retry times by referring to code,
>>> choose_total_tries, choose_local_tries and choose_local_fallback_tries.
>>> They are used to decide whether to do a retry_bucket, retry_descent or
>>> use permutation to do an exhaustive bucket search. We are wondering if
>>> there is another retry strategy:
>>>
>>> a)Backtracking retry. Now the logic of crush_choose_firstn can only
>>> issue an retry either from the initial bucket(retry_descent) or from the
>>> current bucket (retry_bucket), how about retrying the intervening buckets?
>>>
>>> b)Adjust threshold of retry times by other values. We do noticed that
>>> the 'optimal' crush tunable could be used to make it, but we still
>>> encounter unbalanced [g distribution by using the optimal strategy.
>>> Please refer to 4 of the Testing results part.
>>>
>>> c)Add an mechanism that can adjust above mentioned thresholds
>>> adaptively. Maybe we can record the retry times of the previous call for
>>> CRUSH, and adjust retry thresholds automatically according to the record.
> 
> I suggest ignoring all of this retry logic.  The original version of CRUSH 
> has the local retries to try to make data move "less far", but when we 
> went back a year ago and did a statistical analysis of the distribution we 
> found that *all* of these hacks degraded the quality of the placement,a nd 
> by turning them all off (setting the 'optimal' values which zeroes them 
> all out excent for total_retries) we got something that was 
> indistinguishable from a uniform distribution.
> 
>>> 3.Add soft link for pg directories. During pg creation, we can create
>>> soft links for the pgs if pg# on the selected osd is more than some
>>> threshold, say 10% more than desired average number, to move objects
>>> that will be stored in this pg to another osd. Balanced disk utilization
>>> may be gained in this way.
> 
> I think you need to be careful, but yes, this is an option.  There is a 
> similar exception mechanism in place that is used for other purposes and 
> something similar could be done here.  The main challenge will be in 
> ensuring that the soft links/exceptions follow the same overall policy 
> that CRUSH does after the raw mapping is performed.  This is an option, 
> but I would put it toward the bottom of the list...
> 
>>> 4.Change placement strategy only for step of selecting devices from
>>> hosts. We found in our testing results that pg distribution was balanced
>>> among hosts, which is reasonable since pg# of each host is above 1K
>>> (according to the current BKM that pg# per osd should be about 100). So
>>> how about we apply CRUSH only on the interval buckets and find another
>>> simple but more balanced method to choose osd from host?
> 
> This idea has a lot of potential.  For example:
> 
> If you know the chassis can hold 12 disks, you can force the bucket size 
> to twelve and somehow prevent users from adjusting the structure of the 
> tree.  Then you can use a simple mapping that is truly flat (like a linear 
> mapping, disk = x % num_disks) for that bucket/subtree.  The downside of 
> course is that if you remove a disk *everything* reshuffles, hence some 
> sort of guardrails to prevent a user from inadvertantly doing that.  If a 
> disk *does* fail, you just need to make sure the disk is marked "out" but 
> not removed from the CRUSH hierarchy and the normal retry will kick in.
> 
> Note that all this is reall doing is increasing the size of the "buckets" 
> that we are (pseudo)randomly distribution over.  It is still a 
> random/uniform distribution, but the N value is 12 times bigger (for a 12 
> disk chassis) and as a result the variance is substantially lower.
> 
> I would suggest making a new bucket type that is called 'linear' and does 
> a simple modulo and trying this out.  We will need a bunch of additional 
> safety checks to help users avoid doing silly things (like adjusting the 
> number of items in the linear buckets, which reshuffle everything) but 
> that wouldn't be needed for an initial analysis of the performance impact.
> 
> Do you mind if we shift this thread over to ceph-devel?  I think there are 
> lots of people who would be interested in this discussion.  We can of 
> course leave off your attachment if you prefer.
> 
> Thanks!
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

-- 
Loïc Dachary, Artisan Logiciel Libre

Attachment:
signature.asc

Description: OpenPGP digital signature