Re: revisiting uneven CRUSH distributions

Loic Dachary <loic@xxxxxxxxxxx> · Mon, 22 May 2017 22:00:45 +0300

Hi Stefan,

On 05/22/2017 09:44 PM, Stefan Priebe - Profihost AG wrote:
> Hello Loic,
> 
> i want to optimize a crush map. What are the exact steps to archieve this?
> 
> http://crush.readthedocs.io/en/latest/
> doesn't tell me about an optimization command.

It's not published yet. I was hoping to finish it last week but ... I did something really stupid (early optimization :-). Fortunately I realized my mistake tonight while discussing the problem with a friend over a beer. Long story short: I'm optimistic about publishing something sensible in the next few days.

If you send me the ceph report of the cluster you'd like to optimize, I'll make sure it works as expected. I've been using the ceph report you sent me last week as well, it has been very helpful.

Cheers

> 
> Stefan
> 
> Am 15.05.2017 um 21:08 schrieb Stefan Priebe - Profihost AG:
>> Hello Loic,
>>
>> sounds good but my initial question was if this shouldn't be integrated
>> in ceph-deploy - so when you add OSDs it also does the correct reweight?
>>
>> Greets,
>> Stefan
>>
>> Am 14.05.2017 um 19:46 schrieb Loic Dachary:
>>> Hi Stefan,
>>>
>>> A new python-crush[1] subcommand will be available next week that you could use to rebalance your clusters. You give it a crushmap and it optimizes the weights to fix the uneven distribution. It can produce a series of crushmaps, each with a small modification so that you can gradually improve the situation and better control how many PGs are moving.
>>>
>>> Would that be useful for the clusters you have ?
>>>
>>> Cheers
>>>
>>> [1] http://crush.readthedocs.io/
>>>
>>> On 05/02/2017 09:32 AM, Loic Dachary wrote:
>>>>
>>>>
>>>> On 05/02/2017 07:43 AM, Stefan Priebe - Profihost AG wrote:
>>>>> Hi Loic,
>>>>>
>>>>> yes i didn't changed them to straw2 as i didn't saw any difference. I
>>>>> switched to straw2 now but it didn't change anything at all.
>>>>
>>>> straw vs straw2 is not responsible for the uneven distribution you're seeing. I meant to say the optimization only works on straw2 buckets, it is not implemented for straw buckets.
>>>>
>>>>> If i use those weights manuall i've to adjust them on every crush change
>>>>> on the cluster? That's something i don't really like to do.
>>>>
>>>> This is not practical indeed :-) I'm hoping python-crush can automate that.
>>>>
>>>> Cheers
>>>>
>>>>> Greets,
>>>>> Stefan
>>>>>
>>>>> Am 02.05.2017 um 01:12 schrieb Loic Dachary:
>>>>>> It is working, with straw2 (your cluster still is using straw).
>>>>>>
>>>>>> For instance for one host it goes from:
>>>>>>
>>>>>>         ~expected~  ~objects~  ~over/under used %~  ~delta~  ~delta%~
>>>>>> ~name~
>>>>>> osd.24         149        159                 6.65     10.0      6.71
>>>>>> osd.29         149        159                 6.65     10.0      6.71
>>>>>> osd.0           69         77                11.04      8.0     11.59
>>>>>> osd.2           69         69                -0.50      0.0      0.00
>>>>>> osd.42         149        148                -0.73     -1.0     -0.67
>>>>>> osd.1           69         62               -10.59     -7.0    -10.14
>>>>>> osd.23          69         62               -10.59     -7.0    -10.14
>>>>>> osd.36         149        132               -11.46    -17.0    -11.41
>>>>>>
>>>>>> to
>>>>>>
>>>>>>         ~expected~  ~objects~  ~over/under used %~  ~delta~  ~delta%~
>>>>>> ~name~
>>>>>> osd.0           69         69                -0.50      0.0      0.00
>>>>>> osd.23          69         69                -0.50      0.0      0.00
>>>>>> osd.24         149        149                -0.06      0.0      0.00
>>>>>> osd.29         149        149                -0.06      0.0      0.00
>>>>>> osd.36         149        149                -0.06      0.0      0.00
>>>>>> osd.1           69         68                -1.94     -1.0     -1.45
>>>>>> osd.2           69         68                -1.94     -1.0     -1.45
>>>>>> osd.42         149        147                -1.40     -2.0     -1.34
>>>>>>
>>>>>> By changing the weights to
>>>>>>
>>>>>> [0.6609248140022604, 0.9148542821020436, 0.8174711575190294, 0.8870680217468655, 1.6031393139865695, 1.5871079208467038, 1.8784764188501162, 1.7308530904776616]
>>>>>>
>>>>>> And you could set these weights on the crushmap, there would be no need for backporting.
>>>>>>
>>>>>>
>>>>>> On 05/01/2017 08:06 PM, Stefan Priebe - Profihost AG wrote:
>>>>>>> Am 01.05.2017 um 19:47 schrieb Loic Dachary:
>>>>>>>> Hi Stefan,
>>>>>>>>
>>>>>>>> On 05/01/2017 07:15 PM, Stefan Priebe - Profihost AG wrote:
>>>>>>>>> That sounds amazing! Is there any chance this will be backported to jewel?
>>>>>>>>
>>>>>>>> There should be ways to make that work with kraken and jewel. It may not even require a backport. If you know of a cluster with an uneven distribution, it would be great if you could send the crushmap so that I can test the algorithm. I'm still not sure this is the right solution and it would help confirm that.
>>>>>>>
>>>>>>> I've lots of them ;-)
>>>>>>>
>>>>>>> Will sent you one via private e-mail in some minutes.
>>>>>>>
>>>>>>> Greets,
>>>>>>> Stefan
>>>>>>>
>>>>>>>> Cheers
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Greets,
>>>>>>>>> Stefan
>>>>>>>>>
>>>>>>>>> Am 30.04.2017 um 16:15 schrieb Loic Dachary:
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> Ideally CRUSH distributes PGs evenly on OSDs so that they all fill in
>>>>>>>>>> the same proportion. If an OSD is 75% full, it is expected that all
>>>>>>>>>> other OSDs are also 75% full.
>>>>>>>>>>
>>>>>>>>>> In reality the distribution is even only when more than 100,000 PGs
>>>>>>>>>> are distributed in a pool of size 1 (i.e. no replication).
>>>>>>>>>>
>>>>>>>>>> In small clusters there are a few thousands PGs and it is not enough
>>>>>>>>>> to get an even distribution. Running the following with
>>>>>>>>>> python-crush[1], shows a 15% difference when distributing 1,000 PGs on
>>>>>>>>>> 6 devices. Only with 1,000,000 PGs does the difference drop under 1%.
>>>>>>>>>>
>>>>>>>>>>   for PGs in 1000 10000 100000 1000000 ; do
>>>>>>>>>>     crush analyze --replication-count 1 \
>>>>>>>>>>                   --type device \
>>>>>>>>>>                   --values-count $PGs \
>>>>>>>>>>                   --rule data \
>>>>>>>>>>                   --crushmap tests/sample-crushmap.json
>>>>>>>>>>   done
>>>>>>>>>>
>>>>>>>>>> In larger clusters, even though a greater number of PGs are
>>>>>>>>>> distributed, there are at most a few dozens devices per host and the
>>>>>>>>>> problem remains. On a machine with 24 OSDs each expected to handle a
>>>>>>>>>> few hundred PGs, a total of a few thousands PGs are distributed which
>>>>>>>>>> is not enough to get an even distribution.
>>>>>>>>>>
>>>>>>>>>> There is a secondary reason for the distribution to be uneven, when
>>>>>>>>>> there is more than one replica. The second replica must be on a
>>>>>>>>>> different device than the first replica. This conditional probability
>>>>>>>>>> is not taken into account by CRUSH and would create an uneven
>>>>>>>>>> distribution if more than 10,000 PGs were distributed per OSD[2]. But
>>>>>>>>>> a given OSD can only handle a few hundred PGs and this conditional
>>>>>>>>>> probability bias is dominated by the uneven distribution caused by the
>>>>>>>>>> low number of PGs.
>>>>>>>>>>
>>>>>>>>>> The uneven CRUSH distributions are always caused by a low number of
>>>>>>>>>> samples, even in large clusters. Since this noise (i.e. the difference
>>>>>>>>>> between the desired distribution and the actual distribution) is
>>>>>>>>>> random, it cannot be fixed by optimizations methods.  The
>>>>>>>>>> Nedler-Mead[3] simplex converges to a local minimum that is far from
>>>>>>>>>> the optimal minimum in many cases. Broyden–Fletcher–Goldfarb–Shanno[4]
>>>>>>>>>> fails to find a gradient that would allow it to converge faster. And
>>>>>>>>>> even if it did, the local minimum found would be as often wrong as
>>>>>>>>>> with Nedler-Mead, only it would go faster. A least mean squares
>>>>>>>>>> filter[5] is equally unable to suppress the noise created by the
>>>>>>>>>> uneven distribution because no coefficients can model a random noise.
>>>>>>>>>>
>>>>>>>>>> With that in mind, I implemented a simple optimization algorithm[6]
>>>>>>>>>> which was first suggested by Thierry Delamare a few weeks ago. It goes
>>>>>>>>>> like this:
>>>>>>>>>>
>>>>>>>>>>     - Distribute the desired number of PGs[7]
>>>>>>>>>>     - Subtract 1% of the weight of the OSD that is the most over used
>>>>>>>>>>     - Add the subtracted weight to the OSD that is the most under used
>>>>>>>>>>     - Repeat until the Kullback–Leibler divergence[8] is small enough
>>>>>>>>>>
>>>>>>>>>> Quoting Adam Kupczyk, this works because:
>>>>>>>>>>
>>>>>>>>>>   "...CRUSH is not random proces at all, it behaves in numerically
>>>>>>>>>>    stable way.  Specifically, if we increase weight on one node, we
>>>>>>>>>>    will get more PGs on this node and less on every other node:
>>>>>>>>>>    CRUSH([10.1, 10, 10, 5, 5]) -> [146(+3), 152, 156(-2), 70(-1), 76]"
>>>>>>>>>>
>>>>>>>>>> A nice side effect of this optimization algorithm is that it does not
>>>>>>>>>> change the weight of the bucket containing the items being
>>>>>>>>>> optimized. It is local to a bucket with no influence on the other
>>>>>>>>>> parts of the crushmap (modulo the conditional probability bias).
>>>>>>>>>>
>>>>>>>>>> In all tests the situation improves at least by an order of
>>>>>>>>>> magnitude. For instance when there is a 30% difference between two
>>>>>>>>>> OSDs, it is down to less than 3% after optimization.
>>>>>>>>>>
>>>>>>>>>> The tests for the optimization method can be run with
>>>>>>>>>>
>>>>>>>>>>    git clone -b wip-fix-2 http://libcrush.org/dachary/python-crush.git
>>>>>>>>>>    tox -e py27 -- -s -vv -k test_fix tests/test_analyze.py
>>>>>>>>>>
>>>>>>>>>> If anyone think of a reason why this algorithm won't work in some
>>>>>>>>>> cases, please speak up :-)
>>>>>>>>>>
>>>>>>>>>> Cheers
>>>>>>>>>>
>>>>>>>>>> [1] python-crush http://crush.readthedocs.io/
>>>>>>>>>> [2] crush multipick anomaly http://marc.info/?l=ceph-devel&m=148539995928656&w=2
>>>>>>>>>> [3] Nedler-Mead https://en.wikipedia.org/wiki/Nelder%E2%80%93Mead_method
>>>>>>>>>> [4] L-BFGS-B https://docs.scipy.org/doc/scipy-0.18.1/reference/optimize.minimize-lbfgsb.html#optimize-minimize-lbfgsb
>>>>>>>>>> [5] Least mean squares filter https://en.wikipedia.org/wiki/Least_mean_squares_filter
>>>>>>>>>> [6] http://libcrush.org/dachary/python-crush/blob/c6af9bbcbef7123af84ee4d75d63dd1b967213a2/tests/test_analyze.py#L39
>>>>>>>>>> [7] Predicting Ceph PG placement http://dachary.org/?p=4020
>>>>>>>>>> [8] Kullback–Leibler divergence https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence
>>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>>
>>>>>>>>
>>>>>>> --
>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
> 

-- 
Loïc Dachary, Artisan Logiciel Libre
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html