announcing bc-ceph-reweight-by-utilization.py and Re: Calculating the expected PGs distribution

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 05/10/17 08:45, Loic Dachary wrote:
>
> On 05/10/2017 05:31 AM, Xavier Villaneau wrote:
>> On 05/09/2017 at 12:23 PM, Loic Dachary wrote:
>>
>>> Here is what I have so far:
>>>
>>> For N replicas, the weight of the item i (Wi) must be capped to 1/N if
>>> Wi > 1/N.
>>>
>>> The difference (Wi - 1 / N) is the remainder Ri and it must
>>> be distributed evenly among the remaining items.
>>>
>>> There can be at most M = N - 1 overweight items.
>>>
>>> R = sum of all remainder Ri
>>>
>>> for each overweight item, distribute the remainder to all items that are not overweight and that are not already at the maximum weight in proportion of their weight Wj += R * Wj * M
>>>
>>> as a result of the redistribution an item that was not overweight could become overweight and the same process must be repeated until there are no more overweight items
>>>
>>> I realize this is a bit naïve...
>> Hello Loïc,
>>
>> Sorry if I missed some previous discussions where the context of this was explained, but I'm not sure I understand the goal of that calculation. Are you trying to predict a distribution from a set of weights, or trying to find how to rectify a distribution to make it closer to some target weights?
> I was trying to rectify a distribution that has overweight items.
>> If you're trying to predict a distribution, then there is actually no such thing as an "overweighted", the distribution are what they are and the same calculations apply regardless of if the weights are "unreachable" or not. If you want, I can give a detail of what I did for the 5-1-1-1-1 case (and maybe another "regular" case); it's just a question of adding up the right probabilities in the right order.
> Understood.
>
>> If the goal is to achieve a target distribution, then it's a bit more tricky. This would require the algorithm to:
>> 1. Identify cases where a given target distribution has "overweighted" items
>> 2. If that's the case, identify the limit (i.e. "best" distribution possible) and use it to decide on a new target distribution that's within the space of reachable distributions
>> 3. From there, a "regular" approach could apply.
>> Maybe there are other approaches to that issue too.
>>
>> Just one last question: Is there a requirement to solve this problem with inexpensive calculations? Otherwise, couldn't calls to libcrush be used to get an estimate of the distribution or to do trial-and-error re-balancing?
>>
>> Sorry if I'm asking about stuff that was answered before; I'm lacking the context of this talk.
> You're not actually: this was not discussed in any context so far (unless I missed something in the distant past which is possible ;-).
>
> The problem with 5 1 1 1 1 is that the admin who set this weights expects a distribution that cannot happen. The 5TB disk will always be underfull and the 1TB disks will always be overfull, even if the probability bias is fixed. I was hoping that we could find a way to set the weights so that an even distribution is achieved and the sysadmin does not have to know about that corner case. It does not need to be inexpensive.
>
> Thinking a bit more about it, there does not seem to be any way to get an even distribution. It's not a matter of tweaking the weights, as you explained in your previous mail even with an infinite weight the overweight item will still miss the desired number of PGs. The best we can do is to lower the weight of the overweight item to get a set of weights that can lead to an even distribution. We cannot fix the fact that the overweight item will be underfilled. But by fixing the weights we can fix the fact that the other items are overfilled which is what causes problems to the sysadmin.
>
> In our example (with 2 replicas)
>
> ( 5 + 1 + 1 + 1 + 1 ) / 2 = 4.5 therefore all items with a weight > 4.5 are overweight
>
> we remove the overweight items and sum the weight of the remaining items:
>
> ( 1 + 1 + 1 + 1 ) = 4
>
> and we divide by the number of replicas minus the number of overweight items
>
> 4 / ( 2 - 1 ) = 4
>
> and we set the weight of the overweight item to this number
>
> ( 4 + 1 + 1 + 1 + 1 ) / 2 = 4 therefore all items are <= maximum weight
>
> With three replicas:
>
> ( 7 + 7 + 1 + 1 + 1 + 1 + 1 + 1 ) / 3 = 6.66
>
> we set the weight to ( 1 + 1 + 1 + 1 + 1 + 1 ) /  ( 3 - 2 ) = 6 
>
> ( 6 + 6 + 1 + 1 + 1 + 1 + 1 + 1 ) / 3 = 6
>
> With four replicas:
>
> (7 + 7 + 7 + 3 + 3) / 4 = 6.75
>
> we set the weight to ( 3 + 3 ) / ( 4 - 3 ) = 6
>
> (6 + 6 + 6 + 3 + 3) / 4 = 6
>
> With four replicas again:
>
> (5 + 5 + 3 + 3 + 3) / 4 = 4.75
>
> we set the weight to ( 3 + 3 + 3 ) / ( 4 - 2 ) = 4.5
>
> (4.5 + 4.5 + 3 + 3 + 3) / 4 = 4.5
>
> Does that make sense ?
>

I have been reading your thread and not really following too well... I
thought maybe you were reimplementing the CRUSH algorithm so you can
calculate the actual result of your weight changes. But I am not sure if
you are saying it uses CRUSH (which would be great and could be combined
with my ideas below) or just simulates it (which I think will fail just
like all my attempts at calculating).

But seems at this point you haven't firmly chosen some particular
strategy, so I wanted to add my story and strategy...

First I tried `ceph osd reweight-by-utilization` and had bad results.

Here's before I tried it (let's call this df1):

> root@ceph3:~ # ceph osd df | sort -k7n
> ID WEIGHT   REWEIGHT SIZE   USE    AVAIL  %USE  VAR  PGS
> 21  2.00029  1.00000  1862G   530G  1331G 28.51 0.64  48
> 17  2.00029  1.00000  1862G   553G  1308G 29.72 0.67  48
> 14  2.00029  1.00000  1862G   597G  1264G 32.09 0.72  48
> 15  2.00029  1.00000  1862G   597G  1264G 32.09 0.72  45
>  9  2.00029  1.00000  1862G   594G  1267G 31.94 0.72  49
>  8  4.00069  1.00000  3724G  1238G  2485G 33.26 0.75 105
>  2  4.00069  1.00000  3724G  1340G  2383G 36.00 0.81  97
> 10  2.00029  1.00000  1862G   672G  1189G 36.10 0.81  55
>  6  4.00069  1.00000  3724G  1442G  2281G 38.73 0.87 113
> 19  2.00029  1.00000  1862G   749G  1112G 40.25 0.90  59
> 24  2.00029  1.00000  1862G   782G  1080G 42.00 0.94  45
>  5  4.00069  1.00000  3724G  1601G  2123G 42.99 0.96 109
>  4  4.00069  1.00000  3724G  1662G  2061G 44.65 1.00 120
> 25  2.00029  1.00000  1862G   843G  1018G 45.29 1.02  49
>  0  4.00069  1.00000  3724G  1727G  1997G 46.37 1.04 115
> 13  2.00029  1.00000  1862G   893G   968G 48.00 1.08  55
> 12  2.00029  1.00000  1862G   903G   958G 48.51 1.09  59
> 11  2.00029  1.00000  1862G   908G   953G 48.81 1.10  63
> 16  2.00029  1.00000  1862G   918G   944G 49.30 1.11  60
>  7  4.00069  1.00000  3724G  1874G  1849G 50.33 1.13 128
> 26  2.00029  1.00000  1862G   943G   918G 50.65 1.14  50
> 18  2.00029  1.00000  1862G   975G   886G 52.40 1.18  60
> 23  2.00029  1.00000  1862G   978G   883G 52.53 1.18  58
> 22  2.00029  1.00000  1862G  1044G   817G 56.12 1.26  75
> 20  2.00029  1.00000  1862G  1053G   808G 56.59 1.27  69
>  3  4.00069  1.00000  3724G  2190G  1533G 58.82 1.32 125
>  1  4.00069  1.00000  3724G  2257G  1466G 60.62 1.36 141
>                TOTAL 67035G 29876G 37159G 44.57                        
> MIN/MAX VAR: 0.64/1.36  STDDEV: 9.21
And after rerunning it for something like 2 days (with a script that ran
it, waited until HEALTH_OK, ran again ,etc.), it was barely any better
(df2):

> root@ceph3:~ # ceph osd df | sort -k7n
> ID WEIGHT  REWEIGHT SIZE   USE    AVAIL  %USE  VAR  PGS
> 21 2.00029  1.00000  1862G   534G  1327G 28.73 0.64  48
> 17 2.00029  1.00000  1862G   557G  1304G 29.94 0.66  48
>  9 2.00029  1.00000  1862G   598G  1263G 32.16 0.71  50
> 15 2.00029  1.00000  1862G   602G  1259G 32.37 0.72  45
> 14 2.00029  1.00000  1862G   603G  1258G 32.40 0.72  48
>  8 4.00069  1.00000  3724G  1250G  2473G 33.59 0.75 105
> 10 2.00029  1.00000  1862G   690G  1172G 37.06 0.82  58
>  2 4.00069  1.00000  3724G  1422G  2301G 38.19 0.85  98
>  6 4.00069  1.00000  3724G  1457G  2266G 39.13 0.87 113
> 19 2.00029  1.00000  1862G   788G  1073G 42.36 0.94  62
> 24 2.00029  1.00000  1862G   803G  1058G 43.16 0.96  47
>  5 4.00069  1.00000  3724G  1631G  2092G 43.81 0.97 110
> 13 2.00029  0.99001  1862G   837G  1024G 45.00 1.00  54
> 25 2.00029  1.00000  1862G   853G  1008G 45.83 1.02  49
>  4 4.00069  0.99001  3724G  1725G  1998G 46.34 1.03 119
>  0 4.00069  0.99001  3724G  1759G  1964G 47.24 1.05 116
> 11 2.00029  0.99001  1862G   918G   943G 49.30 1.09  63
> 18 2.00029  0.99001  1862G   930G   931G 49.96 1.11  60
> 26 2.00029  0.99001  1862G   946G   916G 50.80 1.13  50
> 16 2.00029  0.99001  1862G   950G   911G 51.05 1.13  62
>  7 4.00069  0.99001  3724G  1905G  1818G 51.18 1.14 129
> 12 2.00029  0.99001  1862G   991G   870G 53.27 1.18  61
> 23 2.00029  0.99001  1862G  1001G   860G 53.79 1.19  59
> 22 2.00029  0.95001  1862G  1033G   828G 55.50 1.23  72
>  3 4.00069  0.93002  3724G  2144G  1579G 57.59 1.28 122
>  1 4.00069  0.91003  3724G  2163G  1561G 58.08 1.29 133
> 20 2.00029  0.93002  1862G  1114G   747G 59.84 1.33  67
>               TOTAL 67035G 30218G 36816G 45.08        
> MIN/MAX VAR: 0.64/1.33  STDDEV: 9.04

And worst was that I could see some osds get more data at one time (see
how comparing the above, osd.20 which was already too full at 56.59% is
now fuller at 59.84%, also osd.23), and get less data another run,
re-moving possibly the same data over and over. And each move was a few
% of the data, so it was quite a lot overall. I didn't record the %
misplaced numbers, but I remember them like 3-7% and some 1.5% maybe. So
this is why I decided this strategy could never be optimal.

So then I tried calculating and estimating something like you describe
here, but far simpler and likely more wrong, but it never acts like I
predict. Like if I have an osd with 80 pgs and I want it to have 79 pgs,
if I multiply the weight by 1/80, it is random whether it moves 0 pgs,
or even up to 4 or 5 sometimes (based on ceph osd df it looks random,
but CRUSH is not random). It didn't work well, taking more of my
attention, but was far faster than ceph osd reweight-by-utilization. And
here's what I ended up with before moving on to another idea (df3):

> root@ceph3:~ # ceph osd df | sort -k7n
> ID WEIGHT  REWEIGHT SIZE   USE    AVAIL  %USE  VAR  PGS
>  3 4.00069  0.50418  3724G  1835G  1888G 49.30 0.79  93
> 23 2.00029  0.71783  1862G   935G   926G 50.22 0.81  55
>  1 4.00069  0.57999  3724G  1882G  1841G 50.55 0.81  96
> 11 2.00029  0.59999  1862G   995G   866G 53.45 0.86  54
> 17 2.00029  1.00000  1862G  1006G   855G 54.06 0.87  61
>  7 4.00069  0.79726  3724G  2059G  1664G 55.31 0.89 120
>  9 2.00029  0.99001  1862G  1113G   748G 59.82 0.96  78
>  4 4.00069  0.71181  3724G  2254G  1469G 60.55 0.97 122
>  5 4.00069  0.81999  3724G  2295G  1429G 61.63 0.99 123
>  8 4.00069  1.00000  3724G  2310G  1413G 62.04 1.00 147
> 14 2.00029  1.00000  1862G  1169G   692G 62.83 1.01  68
> 16 2.00029  0.90533  1862G  1194G   667G 64.13 1.03  72
>  6 4.00069  0.97174  3724G  2394G  1329G 64.30 1.03 147
> 22 2.00029  0.74290  1862G  1198G   663G 64.35 1.03  68
> 10 2.00029  0.79999  1862G  1204G   657G 64.68 1.04  66
>  2 4.00069  1.00000  3724G  2410G  1313G 64.73 1.04 140
> 25 2.00029  0.85004  1862G  1213G   648G 65.17 1.05  60
> 20 2.00029  0.89999  1862G  1222G   639G 65.64 1.05  81
> 21 2.00029  1.00000  1862G  1228G   633G 65.99 1.06  73
>  0 4.00069  0.81006  3724G  2464G  1259G 66.18 1.06 135
> 15 2.00029  1.00000  1862G  1239G   622G 66.56 1.07  67
> 24 2.00029  0.94699  1862G  1263G   598G 67.85 1.09  64
> 18 2.00029  0.76999  1862G  1304G   557G 70.06 1.12  77
> 26 2.00029  0.87448  1862G  1307G   554G 70.22 1.13  67
> 13 2.00029  0.90002  1862G  1318G   543G 70.82 1.14  72
> 12 2.00029  0.96999  1862G  1436G   425G 77.15 1.24  78
> 19 2.00029  0.92999  1862G  1488G   373G 79.94 1.28  84
>               TOTAL 67035G 41749G 25286G 62.28
> MIN/MAX VAR: 0.79/1.28  STDDEV: 7.25


So I decided maybe I'll just use CRUSH and not bother implementing it
myself, but just let ceph do the normal peering. So I thought I could
just look at the pg dump and the up vs acting OSDs, and total up the
bytes in the pgs to estimate the space used now (acting) and in the
future (up) by each osd (doesn't include filesystem overhead, leveldb,
xattrs...don't know what else). And then I could change the weights all
I want until I'm happy with the future result, then let it recover once,
not over and over repeating the same data moving often.

So I wrote a script and use it this way:
- optionally set recovery to low priority
    osd max backfills = 1
    osd recovery max active = 1
    osd recovery op priority = 1
    osd recover max single start = 1
    osd op threads = ...
    # I don't recommend this one...it slows it lots but doesn't seem to
prevent blocked requests, unlike deep scrub sleep which is very significant.
    # osd recovery sleep = ...
- optionally set   ceph osd set norecover   so it will do peering, but
not recovery... highly recommended for the first balance run, or
testing. (for testing, also save output of ceph osd df if you want to
undo the reweights).
- optionally but recommended to set   ceph osd set noscrub ; ceph osd
set nodeep-scrub
- ./bc-ceph-reweight-by-utilization.py -al  (get the script at 
https://github.com/petermaloney/misc/blob/master/ceph/bc-ceph-reweight-by-utilization.py
)
- wait until it either gets stuck (lowest becomes highest, then lowest
again, etc. because the pg it moves is too large to be possible to hit
the target balance), or is at your target (set by -o which is default
1.03), which I guess takes about 15min for a badly balanced cluster
- check the misplaced % in ceph -s output (for me I think it was around
15% to go from df3 to df4)
- if you set norecover, unset it now
- when done, unset whatever flags you set

or just run the script always and forget the rest if you know how your
cluster behaves with it already, including if an osd fails which I
didn't test.

pros:
- move way less data overall
- and be able to see the future result right away (but slightly
imprecise...does not include filesystem overhead)
- has an option -l to do a loop combined with -a so you don't have to
script something to rerun it until it's done... it looks at cluster
health and things like that for you
cons:
- move all data at the same time, not "gently" ... but I don't consider
this very important since the gentlest you can do (in theory 1 whole pg
at a time) still takes a long time all at once, and you can set conf to
slow it down.
- due to doing peering so often, it would in theory make the cluster
store many more osdmap epochs that might add some overhead to the
cluster and clients... but I don't notice any effect.
- probably doesn't handle some cases like osds that went down during the
loop
neutral:
- probably have to rerun it after adding or removing new disks to get
balance to be right after recovery, and each time it would add more data
movement than without balancing
- because it uses reweight, remember that when an osd is set out, it
resets the weight to 1. kraken 11.2.0 should no longer do this "mon:
preserve osd weight when marking osd out, then in (pr#11293, Sage Weil)"
bugs:
- when ceph osd df has "-nan" in the output (such as with an osd that is
in the crush map but doesn't really exist), the json output is invalid
so the python json.loads(...) function throws an exception

after recovery, here is what it looked like (df4):

> root@ceph3:~ # ceph osd df | sort -k7n
> ID WEIGHT  REWEIGHT SIZE   USE    AVAIL  %USE  VAR  PGS
>  9 2.00029  1.00000  1862G  1116G   745G 59.98 0.96  77
> 10 2.00029  0.67000  1862G  1118G   743G 60.06 0.96  63
> 17 2.00029  1.00000  1862G  1132G   729G 60.80 0.97  67
>  8 4.00069  1.00000  3724G  2267G  1456G 60.90 0.98 140
> 14 2.00029  0.84999  1862G  1139G   722G 61.18 0.98  65
> 18 2.00029  0.69249  1862G  1141G   720G 61.30 0.98  67
> 15 2.00029  0.83749  1862G  1142G   719G 61.35 0.98  62
>  4 4.00069  0.71181  3724G  2286G  1437G 61.41 0.98 125
>  5 4.00069  0.79999  3724G  2288G  1436G 61.44 0.98 124
> 20 2.00029  0.89999  1862G  1152G   709G 61.88 0.99  78
>  7 4.00069  0.90250  3724G  2325G  1398G 62.44 1.00 136
>  0 4.00069  0.73000  3724G  2330G  1393G 62.59 1.00 125
> 16 2.00029  0.87000  1862G  1166G   695G 62.64 1.00  77
> 23 2.00029  0.85950  1862G  1170G   691G 62.87 1.01  64
>  3 4.00069  0.64249  3724G  2343G  1380G 62.93 1.01 116
>  1 4.00069  0.68449  3724G  2344G  1379G 62.95 1.01 122
> 21 2.00029  0.87849  1862G  1173G   688G 63.04 1.01  70
> 11 2.00029  0.75999  1862G  1174G   687G 63.07 1.01  67
> 13 2.00029  0.81000  1862G  1174G   687G 63.09 1.01  67
>  6 4.00069  0.97174  3724G  2349G  1374G 63.10 1.01 143
> 12 2.00029  0.79500  1862G  1180G   681G 63.40 1.02  71
>  2 4.00069  0.87949  3724G  2361G  1362G 63.41 1.02 129
> 24 2.00029  0.92000  1862G  1187G   674G 63.75 1.02  60
> 19 2.00029  0.80849  1862G  1188G   673G 63.83 1.02  70
> 26 2.00029  0.74550  1862G  1188G   673G 63.83 1.02  56
> 25 2.00029  0.84250  1862G  1193G   668G 64.10 1.03  59
> 22 2.00029  0.70999  1862G  1201G   661G 64.50 1.03  68
>               TOTAL 67035G 41841G 25194G 62.42        
> MIN/MAX VAR: 0.96/1.03  STDDEV: 1.22
>
My largest PGs were about 120GB, so if you look at the lowest used 2TB
disk above (1116G used) and add 120GB (1230 GB), you'll see it is
probably worse (compared to osd.22 with only 1201 GB, and the fullest
osd drives the total free space, so we would rather the highest be lower
than the lowest be more average), so this is very close to the best I
can do without adding more pgs.


And today (at least 3 weeks later), after adding disks and data, it
stays balanced easily, with short reruns of the script. And my largest
pg is now 102GB, so it is slightly better balanced (df5).

> root@ceph3:~ # ceph osd df | sort -k7n
> ID WEIGHT  REWEIGHT SIZE  USE    AVAIL  %USE  VAR  PGS
> 15 2.00029  0.45050 1862G  1102G   759G 59.20 0.96  38
>  9 2.00029  0.94865 1862G  1123G   738G 60.32 0.98  49
>  3 4.00069  0.44760 3724G  2247G  1476G 60.36 0.98  72
> 12 2.00029  0.57643 1862G  1128G   733G 60.60 0.98  41
> 31 6.00110  0.68811 5587G  3390G  2197G 60.68 0.98 111
>  6 4.00069  0.87955 3724G  2261G  1463G 60.71 0.98  95
> 20 2.00029  0.34117 1862G  1135G   726G 61.00 0.99  27
> 35 6.00110  0.76208 5587G  3419G  2167G 61.21 0.99 131
> 27 4.00069  0.62904 3724G  2280G  1443G 61.23 0.99  88
>  7 4.00069  0.75883 3724G  2283G  1440G 61.33 0.99  83
> 18 2.00029  0.67392 1862G  1142G   719G 61.35 0.99  43
> 16 2.00029  0.67917 1862G  1142G   719G 61.36 0.99  46
>  2 4.00069  0.62636 3724G  2291G  1432G 61.53 1.00  70
> 32 6.00110  0.62749 5587G  3441G  2145G 61.60 1.00 110
> 25 2.00029  0.73212 1862G  1148G   713G 61.66 1.00  36
>  1 4.00069  0.49037 3724G  2302G  1421G 61.82 1.00  74
>  5 4.00069  0.62141 3724G  2304G  1419G 61.89 1.00  75
> 30 6.00110  0.53374 5587G  3459G  2127G 61.92 1.00 104
> 21 2.00029  0.63028 1862G  1153G   708G 61.94 1.00  37
> 17 2.00029  0.95152 1862G  1157G   705G 62.14 1.00  45
> 14 2.00029  0.78682 1862G  1157G   704G 62.16 1.01  39
> 23 2.00029  0.94708 1862G  1158G   703G 62.21 1.01  43
> 19 2.00029  0.38724 1862G  1158G   703G 62.22 1.01  35
>  4 4.00069  0.61388 3724G  2317G  1407G 62.22 1.01  88
> 26 2.00029  0.62029 1862G  1159G   702G 62.29 1.01  33
> 10 2.00029  0.68456 1862G  1162G   699G 62.43 1.01  50
> 13 2.00029  0.67427 1862G  1164G   697G 62.52 1.01  48
> 29 4.00069  0.66026 3724G  2328G  1395G 62.52 1.01  72
> 33 6.00110  0.75240 5587G  3495G  2091G 62.57 1.01 125
>  8 4.00069  0.88206 3724G  2332G  1391G 62.64 1.01  91
> 28 4.00069  0.72832 3724G  2333G  1390G 62.67 1.01  85
>  0 4.00069  0.52428 3724G  2340G  1383G 62.85 1.02  72
> 34 6.00110  0.69791 5587G  3513G  2073G 62.89 1.02 109
> 22 2.00029  0.74596 1862G  1171G   690G 62.91 1.02  42
> 11 2.00029  0.60620 1862G  1185G   677G 63.64 1.03  36
> 24 2.00029  0.56110 1862G  1192G   669G 64.04 1.04  25
>               TOTAL  109T 69088G 42641G 61.84         
> MIN/MAX VAR: 0.96/1.04  STDDEV: 0.92

-- 

--------------------------------------------
Peter Maloney
Brockmann Consult
Max-Planck-Str. 2
21502 Geesthacht
Germany
Tel: +49 4152 889 300
Fax: +49 4152 889 333
E-mail: peter.maloney@xxxxxxxxxxxxxxxxxxxx
Internet: http://www.brockmann-consult.de
--------------------------------------------

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux