Re: revisiting uneven CRUSH distributions

Dan van der Ster <dan@xxxxxxxxxxxxxx> · Wed, 3 May 2017 19:59:26 +0200

On Wed, May 3, 2017 at 6:50 PM, Loic Dachary <loic@xxxxxxxxxxx> wrote:
>
>
> On 05/03/2017 11:35 AM, Dan van der Ster wrote:
>> On Tue, May 2, 2017 at 6:16 PM, Loic Dachary <loic@xxxxxxxxxxx> wrote:
>>> Greg raised the following problem today: what if, as a consequence of changing the weights, the failure of a host/rack (whatever the failure domain is) makes the cluster full ? For instance if you have racks 1, 2, 3 with "effective" weights .8, 1.1, 1 and you lose half of rack 3 then rack 2 is going to get a lot more of the data than rack 1 is.
>>>
>>
>> Is this really a problem? In your example, the rack weights are
>> tweaked to correct the "rate" at which CRUSH is assigning PGs to each
>> rack. If you fail half of rack 3, then your effective weights will
>> continue to ensure that the moved PGs get equally assigned to racks 1
>> and 2.
>
> It should be possible to verify if the opimitization makes things worse in case of a failure, just by running a simulation with every failure scenario. If the worst scenario (i.e. the one with the highest overfull OSD) before optimization is better than the worst scenario after optimization, the opimization can be discarded.
>

OK, worth verifying like you said.

>> On the other hand, one problem I see with your new approach is that it
>> does not address the secondary multi-pick problem, which is that the
>> ratio of 1st, 2nd, 3rd, etc... replicas/stripes is not equal for the
>> lower weighted OSDs.
>
> Note that in pools with less than 10,000 PGs the multi-pick problem does not happen: there are too few samples and the uneven distribution is dominated by that problem.

OK perhaps you're right. The scenario I try to consider is for very
wide erasure coding -- say 8+4 or wider -- which has the same effect
as a size=12 replication pool.
I should use your simulator to provide real examples, I know.

>
> However, I think the proposed algorithm could also work by tweaking the weights of each replica (but I only thought about it right now so...):
>
>   first pick uses the target weights, say 1 1 1 1 10, always
>   second pick uses the target weights the first time
>   run a simulation and lower the weight of the item that is the most over full and increase the weight of the item that is the most under full
>   repeat until the distribution is even
>   do the same for the third pick etc.
>
> If we do that we have the desired property of a distribution that is stable when we change the size of the pool. The key difference with the previous approaches is that the weights are adjusted based on repeated simulations instead of maths. For every pool know the exact value of each PG placed by Ceph using CRUSH.
>
> Does that make sense or am I missing something ?

Sounds worth a try.

Thinking (on my toes) about this a bit more, assuming that this
iterative algorithm will bear fruit, I could imagine an interface
like:

ceph osd crush reweight-by-pg <bucket> <num iterations>

Each iteration does what you described: subtract a small amount of
(crush) weight from the fullest OSD, add that (crush) weight back to
the emptiest. [1]
On a production cluster with lots of data, the operator could minimize
data movement by invoking just a small number of iterations at once.
New clusters could run, say, a million iterations to quickly find the
optimal weights.

ceph-mgr could play a role by periodically invoking "ceph osd crush
reweight-by-pg" -- a cron of sorts -- or it could invoke that based on
some conditions related to cluster IO activity.