Re: how to improve ceph cluster capacity usage

huang jun <hjwsm1989@xxxxxxxxx> · Wed, 2 Sep 2015 17:29:30 +0800

After search the source code, i found ceph_psim tool which can
simulate objects distribution,
but it seems a little simple.

2015-09-01 22:58 GMT+08:00 huang jun <hjwsm1989@xxxxxxxxx>:
> hi,all
>
> Recently, i did some experiments on OSD data distribution,
> we set up a cluster with 72 OSDs,all 2TB sata disk,
> and ceph version is v0.94.3 and linux kernel version is 3.18,
> and set "ceph osd crush tunables optimal".
> There are 3 pools:
> pool 0 'rbd' replicated size 2 min_size 1 crush_ruleset 0 object_hash
> rjenkins pg_num 512 pgp_num 512 last_change 302 stripe_width 0
> pool 1 'data' replicated size 2 min_size 1 crush_ruleset 0 object_hash
> rjenkins pg_num 4096 pgp_num 4096 last_change 832
> crash_replay_interval 45 stripe_width 0
> pool 2 'metadata' replicated size 3 min_size 1 crush_ruleset 0
> object_hash rjenkins pg_num 512 pgp_num 512 last_change 302
> stripe_width 0
>
> the osd pg num of each osd:
> pool  : 0      1      2      | SUM
> ----------------------------------------
> osd.0   13     105    18     | 136
> osd.1   17     110    26     | 153
> osd.2   15     114    20     | 149
> osd.3   11     101    17     | 129
> osd.4   8      106    17     | 131
> osd.5   12     102    19     | 133
> osd.6   19     114    29     | 162
> osd.7   16     115    21     | 152
> osd.8   15     117    25     | 157
> osd.9   13     117    23     | 153
> osd.10  13     133    16     | 162
> osd.11  14     105    21     | 140
> osd.12  11     94     16     | 121
> osd.13  12     110    21     | 143
> osd.14  20     119    26     | 165
> osd.15  12     125    19     | 156
> osd.16  15     126    22     | 163
> osd.17  13     109    19     | 141
> osd.18  8      119    19     | 146
> osd.19  14     114    19     | 147
> osd.20  17     113    29     | 159
> osd.21  17     111    27     | 155
> osd.22  13     121    20     | 154
> osd.23  14     95     23     | 132
> osd.24  17     110    26     | 153
> osd.25  13     133    15     | 161
> osd.26  17     124    24     | 165
> osd.27  16     119    20     | 155
> osd.28  19     134    30     | 183
> osd.29  13     121    20     | 154
> osd.30  11     97     20     | 128
> osd.31  12     109    18     | 139
> osd.32  10     112    15     | 137
> osd.33  18     114    28     | 160
> osd.34  19     112    29     | 160
> osd.35  16     121    32     | 169
> osd.36  13     111    18     | 142
> osd.37  15     107    22     | 144
> osd.38  21     129    24     | 174
> osd.39  9      121    17     | 147
> osd.40  11     102    18     | 131
> osd.41  14     101    19     | 134
> osd.42  16     119    25     | 160
> osd.43  12     118    13     | 143
> osd.44  17     114    25     | 156
> osd.45  11     114    15     | 140
> osd.46  12     107    16     | 135
> osd.47  15     111    23     | 149
> osd.48  14     115    20     | 149
> osd.49  9      94     13     | 116
> osd.50  14     117    18     | 149
> osd.51  13     112    19     | 144
> osd.52  11     126    22     | 159
> osd.53  12     122    18     | 152
> osd.54  13     121    20     | 154
> osd.55  17     114    25     | 156
> osd.56  11     118    18     | 147
> osd.57  22     137    25     | 184
> osd.58  15     105    22     | 142
> osd.59  13     120    18     | 151
> osd.60  12     110    19     | 141
> osd.61  21     114    28     | 163
> osd.62  12     97     18     | 127
> osd.63  19     109    31     | 159
> osd.64  10     132    21     | 163
> osd.65  19     137    21     | 177
> osd.66  22     107    32     | 161
> osd.67  12     107    20     | 139
> osd.68  14     100    22     | 136
> osd.69  16     110    24     | 150
> osd.70  9      101    14     | 124
> osd.71  15     112    24     | 151
>
> ----------------------------------------
> SUM   : 1024   8192   1536   |
>
> We can found that, for poolid=1(data pool),
> osd.57 and osd.65 both have 137 PGs but osd.12 and osd.49 only have 94 PGs,
> which maybe cause data distribution imbanlance, and reduces the space
> utilization of the cluster.
>
> Use "crushtool -i crush.raw --test --show-mappings --rule 0 --num-rep
> 2 --min-x 1 --max-x %s"
> we tested different pool pg_num:
>
> Total PG num PG num stats
> ------------ -------------------
> 4096 avg: 113.777778 (avg stands for avg PG num of every osd)
> total: 8192  (total stands for total PG num, include replica PG)
> max: 139 +0.221680 (max stands for max PG num on OSD, +0.221680 stands
> for percent above avage PG num )
> min: 113 -0.226562 (min stands for min PG num on OSD, -0.226562 stands
> for ratio below avage PG num )
>
> 8192 avg: 227.555556
> total: 16384
> max: 267 0.173340
> min: 226 -0.129883
>
> 16384 avg: 455.111111
> total: 32768
> max: 502 0.103027
> min: 455 -0.127686
>
> 32768 avg: 910.222222
> total: 65536
> max: 966 0.061279
> min: 910 -0.076050
>
> With bigger pg_num, the gap between the maximum and the minimum decreased.
> But it's unreasonable to set such large pg_num, which will increase
> OSD and MON load.
>
> Is there any way to get a more balanced PG distribution of the cluster?
> We tried "ceph osd reweight-by-pg 110 data" many times, but that can
> not resolve the problem.
>
> Another problem is that if we can ensure the PG is distributed
> balanced, can we ensure the data
> distribution is balanced like PG?
>
> Btw, we will write data to this cluster until one or more osd get
> full, we set full ratio to 0.98,
> and we expect the cluster can use 0.9 total capacity.
>
> Any tips are welcome.
>
> --
> thanks
> huangjun

-- 
thanks
huangjun
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html