Re: [ceph-users] how to improve ceph cluster capacity usage

Gregory Farnum <gfarnum@xxxxxxxxxx> · Tue, 8 Sep 2015 14:11:07 +0100

On Tue, Sep 1, 2015 at 3:58 PM, huang jun <hjwsm1989@xxxxxxxxx> wrote:
> hi,all
>
> Recently, i did some experiments on OSD data distribution,
> we set up a cluster with 72 OSDs,all 2TB sata disk,
> and ceph version is v0.94.3 and linux kernel version is 3.18,
> and set "ceph osd crush tunables optimal".
> There are 3 pools:
> pool 0 'rbd' replicated size 2 min_size 1 crush_ruleset 0 object_hash
> rjenkins pg_num 512 pgp_num 512 last_change 302 stripe_width 0
> pool 1 'data' replicated size 2 min_size 1 crush_ruleset 0 object_hash
> rjenkins pg_num 4096 pgp_num 4096 last_change 832
> crash_replay_interval 45 stripe_width 0
> pool 2 'metadata' replicated size 3 min_size 1 crush_ruleset 0
> object_hash rjenkins pg_num 512 pgp_num 512 last_change 302
> stripe_width 0
>
> the osd pg num of each osd:
> pool  : 0      1      2      | SUM
> ----------------------------------------
> osd.0   13     105    18     | 136
> osd.1   17     110    26     | 153
> osd.2   15     114    20     | 149
> osd.3   11     101    17     | 129
> osd.4   8      106    17     | 131
> osd.5   12     102    19     | 133
> osd.6   19     114    29     | 162
> osd.7   16     115    21     | 152
> osd.8   15     117    25     | 157
> osd.9   13     117    23     | 153
> osd.10  13     133    16     | 162
> osd.11  14     105    21     | 140
> osd.12  11     94     16     | 121
> osd.13  12     110    21     | 143
> osd.14  20     119    26     | 165
> osd.15  12     125    19     | 156
> osd.16  15     126    22     | 163
> osd.17  13     109    19     | 141
> osd.18  8      119    19     | 146
> osd.19  14     114    19     | 147
> osd.20  17     113    29     | 159
> osd.21  17     111    27     | 155
> osd.22  13     121    20     | 154
> osd.23  14     95     23     | 132
> osd.24  17     110    26     | 153
> osd.25  13     133    15     | 161
> osd.26  17     124    24     | 165
> osd.27  16     119    20     | 155
> osd.28  19     134    30     | 183
> osd.29  13     121    20     | 154
> osd.30  11     97     20     | 128
> osd.31  12     109    18     | 139
> osd.32  10     112    15     | 137
> osd.33  18     114    28     | 160
> osd.34  19     112    29     | 160
> osd.35  16     121    32     | 169
> osd.36  13     111    18     | 142
> osd.37  15     107    22     | 144
> osd.38  21     129    24     | 174
> osd.39  9      121    17     | 147
> osd.40  11     102    18     | 131
> osd.41  14     101    19     | 134
> osd.42  16     119    25     | 160
> osd.43  12     118    13     | 143
> osd.44  17     114    25     | 156
> osd.45  11     114    15     | 140
> osd.46  12     107    16     | 135
> osd.47  15     111    23     | 149
> osd.48  14     115    20     | 149
> osd.49  9      94     13     | 116
> osd.50  14     117    18     | 149
> osd.51  13     112    19     | 144
> osd.52  11     126    22     | 159
> osd.53  12     122    18     | 152
> osd.54  13     121    20     | 154
> osd.55  17     114    25     | 156
> osd.56  11     118    18     | 147
> osd.57  22     137    25     | 184
> osd.58  15     105    22     | 142
> osd.59  13     120    18     | 151
> osd.60  12     110    19     | 141
> osd.61  21     114    28     | 163
> osd.62  12     97     18     | 127
> osd.63  19     109    31     | 159
> osd.64  10     132    21     | 163
> osd.65  19     137    21     | 177
> osd.66  22     107    32     | 161
> osd.67  12     107    20     | 139
> osd.68  14     100    22     | 136
> osd.69  16     110    24     | 150
> osd.70  9      101    14     | 124
> osd.71  15     112    24     | 151
>
> ----------------------------------------
> SUM   : 1024   8192   1536   |
>
> We can found that, for poolid=1(data pool),
> osd.57 and osd.65 both have 137 PGs but osd.12 and osd.49 only have 94 PGs,
> which maybe cause data distribution imbanlance, and reduces the space
> utilization of the cluster.
>
> Use "crushtool -i crush.raw --test --show-mappings --rule 0 --num-rep
> 2 --min-x 1 --max-x %s"
> we tested different pool pg_num:
>
> Total PG num PG num stats
> ------------ -------------------
> 4096 avg: 113.777778 (avg stands for avg PG num of every osd)
> total: 8192  (total stands for total PG num, include replica PG)
> max: 139 +0.221680 (max stands for max PG num on OSD, +0.221680 stands
> for percent above avage PG num )
> min: 113 -0.226562 (min stands for min PG num on OSD, -0.226562 stands
> for ratio below avage PG num )
>
> 8192 avg: 227.555556
> total: 16384
> max: 267 0.173340
> min: 226 -0.129883
>
> 16384 avg: 455.111111
> total: 32768
> max: 502 0.103027
> min: 455 -0.127686
>
> 32768 avg: 910.222222
> total: 65536
> max: 966 0.061279
> min: 910 -0.076050
>
> With bigger pg_num, the gap between the maximum and the minimum decreased.
> But it's unreasonable to set such large pg_num, which will increase
> OSD and MON load.
>
> Is there any way to get a more balanced PG distribution of the cluster?
> We tried "ceph osd reweight-by-pg 110 data" many times, but that can
> not resolve the problem.

The numbers you're seeing here look broadly typical to me. We've
explored a few ideas but have not found anything very satisfactory for
providing more even distribution. So far it's just the cost of doing
business with a pseudorandom placement algorithm. (And it's common to
all storage systems using this mechanism, as far as I can tell.)

>
> Another problem is that if we can ensure the PG is distributed
> balanced, can we ensure the data
> distribution is balanced like PG?

We haven't found this to be an issue the same way PG distribution is.
Due to how objects are placed within PGs, your PGs will all tend to be
of size X or of size 2*X, and the numbers of both will be large. (You
can keep it to only size X by setting pgnum to be a power of 2, but I
don't think it's worth worrying about much.)
-Greg

>
> Btw, we will write data to this cluster until one or more osd get
> full, we set full ratio to 0.98,
> and we expect the cluster can use 0.9 total capacity.
>
> Any tips are welcome.
>
> --
> thanks
> huangjun
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html