Re: revisiting uneven CRUSH distributions

Stefan Priebe - Profihost AG <s.priebe@xxxxxxxxxxxx> · Mon, 1 May 2017 20:06:07 +0200

Am 01.05.2017 um 19:47 schrieb Loic Dachary:
> Hi Stefan,
> 
> On 05/01/2017 07:15 PM, Stefan Priebe - Profihost AG wrote:
>> That sounds amazing! Is there any chance this will be backported to jewel?
> 
> There should be ways to make that work with kraken and jewel. It may not even require a backport. If you know of a cluster with an uneven distribution, it would be great if you could send the crushmap so that I can test the algorithm. I'm still not sure this is the right solution and it would help confirm that.

I've lots of them ;-)

Will sent you one via private e-mail in some minutes.

Greets,
Stefan

> Cheers
> 
>>
>> Greets,
>> Stefan
>>
>> Am 30.04.2017 um 16:15 schrieb Loic Dachary:
>>> Hi,
>>>
>>> Ideally CRUSH distributes PGs evenly on OSDs so that they all fill in
>>> the same proportion. If an OSD is 75% full, it is expected that all
>>> other OSDs are also 75% full.
>>>
>>> In reality the distribution is even only when more than 100,000 PGs
>>> are distributed in a pool of size 1 (i.e. no replication).
>>>
>>> In small clusters there are a few thousands PGs and it is not enough
>>> to get an even distribution. Running the following with
>>> python-crush[1], shows a 15% difference when distributing 1,000 PGs on
>>> 6 devices. Only with 1,000,000 PGs does the difference drop under 1%.
>>>
>>>   for PGs in 1000 10000 100000 1000000 ; do
>>>     crush analyze --replication-count 1 \
>>>                   --type device \
>>>                   --values-count $PGs \
>>>                   --rule data \
>>>                   --crushmap tests/sample-crushmap.json
>>>   done
>>>
>>> In larger clusters, even though a greater number of PGs are
>>> distributed, there are at most a few dozens devices per host and the
>>> problem remains. On a machine with 24 OSDs each expected to handle a
>>> few hundred PGs, a total of a few thousands PGs are distributed which
>>> is not enough to get an even distribution.
>>>
>>> There is a secondary reason for the distribution to be uneven, when
>>> there is more than one replica. The second replica must be on a
>>> different device than the first replica. This conditional probability
>>> is not taken into account by CRUSH and would create an uneven
>>> distribution if more than 10,000 PGs were distributed per OSD[2]. But
>>> a given OSD can only handle a few hundred PGs and this conditional
>>> probability bias is dominated by the uneven distribution caused by the
>>> low number of PGs.
>>>
>>> The uneven CRUSH distributions are always caused by a low number of
>>> samples, even in large clusters. Since this noise (i.e. the difference
>>> between the desired distribution and the actual distribution) is
>>> random, it cannot be fixed by optimizations methods.  The
>>> Nedler-Mead[3] simplex converges to a local minimum that is far from
>>> the optimal minimum in many cases. Broyden–Fletcher–Goldfarb–Shanno[4]
>>> fails to find a gradient that would allow it to converge faster. And
>>> even if it did, the local minimum found would be as often wrong as
>>> with Nedler-Mead, only it would go faster. A least mean squares
>>> filter[5] is equally unable to suppress the noise created by the
>>> uneven distribution because no coefficients can model a random noise.
>>>
>>> With that in mind, I implemented a simple optimization algorithm[6]
>>> which was first suggested by Thierry Delamare a few weeks ago. It goes
>>> like this:
>>>
>>>     - Distribute the desired number of PGs[7]
>>>     - Subtract 1% of the weight of the OSD that is the most over used
>>>     - Add the subtracted weight to the OSD that is the most under used
>>>     - Repeat until the Kullback–Leibler divergence[8] is small enough
>>>
>>> Quoting Adam Kupczyk, this works because:
>>>
>>>   "...CRUSH is not random proces at all, it behaves in numerically
>>>    stable way.  Specifically, if we increase weight on one node, we
>>>    will get more PGs on this node and less on every other node:
>>>    CRUSH([10.1, 10, 10, 5, 5]) -> [146(+3), 152, 156(-2), 70(-1), 76]"
>>>
>>> A nice side effect of this optimization algorithm is that it does not
>>> change the weight of the bucket containing the items being
>>> optimized. It is local to a bucket with no influence on the other
>>> parts of the crushmap (modulo the conditional probability bias).
>>>
>>> In all tests the situation improves at least by an order of
>>> magnitude. For instance when there is a 30% difference between two
>>> OSDs, it is down to less than 3% after optimization.
>>>
>>> The tests for the optimization method can be run with
>>>
>>>    git clone -b wip-fix-2 http://libcrush.org/dachary/python-crush.git
>>>    tox -e py27 -- -s -vv -k test_fix tests/test_analyze.py
>>>
>>> If anyone think of a reason why this algorithm won't work in some
>>> cases, please speak up :-)
>>>
>>> Cheers
>>>
>>> [1] python-crush http://crush.readthedocs.io/
>>> [2] crush multipick anomaly http://marc.info/?l=ceph-devel&m=148539995928656&w=2
>>> [3] Nedler-Mead https://en.wikipedia.org/wiki/Nelder%E2%80%93Mead_method
>>> [4] L-BFGS-B https://docs.scipy.org/doc/scipy-0.18.1/reference/optimize.minimize-lbfgsb.html#optimize-minimize-lbfgsb
>>> [5] Least mean squares filter https://en.wikipedia.org/wiki/Least_mean_squares_filter
>>> [6] http://libcrush.org/dachary/python-crush/blob/c6af9bbcbef7123af84ee4d75d63dd1b967213a2/tests/test_analyze.py#L39
>>> [7] Predicting Ceph PG placement http://dachary.org/?p=4020
>>> [8] Kullback–Leibler divergence https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence
>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html