After search the source code, i found ceph_psim tool which can simulate objects distribution, but it seems a little simple. 2015-09-01 22:58 GMT+08:00 huang jun <hjwsm1989@xxxxxxxxx>: > hi,all > > Recently, i did some experiments on OSD data distribution, > we set up a cluster with 72 OSDs,all 2TB sata disk, > and ceph version is v0.94.3 and linux kernel version is 3.18, > and set "ceph osd crush tunables optimal". > There are 3 pools: > pool 0 'rbd' replicated size 2 min_size 1 crush_ruleset 0 object_hash > rjenkins pg_num 512 pgp_num 512 last_change 302 stripe_width 0 > pool 1 'data' replicated size 2 min_size 1 crush_ruleset 0 object_hash > rjenkins pg_num 4096 pgp_num 4096 last_change 832 > crash_replay_interval 45 stripe_width 0 > pool 2 'metadata' replicated size 3 min_size 1 crush_ruleset 0 > object_hash rjenkins pg_num 512 pgp_num 512 last_change 302 > stripe_width 0 > > the osd pg num of each osd: > pool : 0 1 2 | SUM > ---------------------------------------- > osd.0 13 105 18 | 136 > osd.1 17 110 26 | 153 > osd.2 15 114 20 | 149 > osd.3 11 101 17 | 129 > osd.4 8 106 17 | 131 > osd.5 12 102 19 | 133 > osd.6 19 114 29 | 162 > osd.7 16 115 21 | 152 > osd.8 15 117 25 | 157 > osd.9 13 117 23 | 153 > osd.10 13 133 16 | 162 > osd.11 14 105 21 | 140 > osd.12 11 94 16 | 121 > osd.13 12 110 21 | 143 > osd.14 20 119 26 | 165 > osd.15 12 125 19 | 156 > osd.16 15 126 22 | 163 > osd.17 13 109 19 | 141 > osd.18 8 119 19 | 146 > osd.19 14 114 19 | 147 > osd.20 17 113 29 | 159 > osd.21 17 111 27 | 155 > osd.22 13 121 20 | 154 > osd.23 14 95 23 | 132 > osd.24 17 110 26 | 153 > osd.25 13 133 15 | 161 > osd.26 17 124 24 | 165 > osd.27 16 119 20 | 155 > osd.28 19 134 30 | 183 > osd.29 13 121 20 | 154 > osd.30 11 97 20 | 128 > osd.31 12 109 18 | 139 > osd.32 10 112 15 | 137 > osd.33 18 114 28 | 160 > osd.34 19 112 29 | 160 > osd.35 16 121 32 | 169 > osd.36 13 111 18 | 142 > osd.37 15 107 22 | 144 > osd.38 21 129 24 | 174 > osd.39 9 121 17 | 147 > osd.40 11 102 18 | 131 > osd.41 14 101 19 | 134 > osd.42 16 119 25 | 160 > osd.43 12 118 13 | 143 > osd.44 17 114 25 | 156 > osd.45 11 114 15 | 140 > osd.46 12 107 16 | 135 > osd.47 15 111 23 | 149 > osd.48 14 115 20 | 149 > osd.49 9 94 13 | 116 > osd.50 14 117 18 | 149 > osd.51 13 112 19 | 144 > osd.52 11 126 22 | 159 > osd.53 12 122 18 | 152 > osd.54 13 121 20 | 154 > osd.55 17 114 25 | 156 > osd.56 11 118 18 | 147 > osd.57 22 137 25 | 184 > osd.58 15 105 22 | 142 > osd.59 13 120 18 | 151 > osd.60 12 110 19 | 141 > osd.61 21 114 28 | 163 > osd.62 12 97 18 | 127 > osd.63 19 109 31 | 159 > osd.64 10 132 21 | 163 > osd.65 19 137 21 | 177 > osd.66 22 107 32 | 161 > osd.67 12 107 20 | 139 > osd.68 14 100 22 | 136 > osd.69 16 110 24 | 150 > osd.70 9 101 14 | 124 > osd.71 15 112 24 | 151 > > ---------------------------------------- > SUM : 1024 8192 1536 | > > We can found that, for poolid=1(data pool), > osd.57 and osd.65 both have 137 PGs but osd.12 and osd.49 only have 94 PGs, > which maybe cause data distribution imbanlance, and reduces the space > utilization of the cluster. > > Use "crushtool -i crush.raw --test --show-mappings --rule 0 --num-rep > 2 --min-x 1 --max-x %s" > we tested different pool pg_num: > > Total PG num PG num stats > ------------ ------------------- > 4096 avg: 113.777778 (avg stands for avg PG num of every osd) > total: 8192 (total stands for total PG num, include replica PG) > max: 139 +0.221680 (max stands for max PG num on OSD, +0.221680 stands > for percent above avage PG num ) > min: 113 -0.226562 (min stands for min PG num on OSD, -0.226562 stands > for ratio below avage PG num ) > > 8192 avg: 227.555556 > total: 16384 > max: 267 0.173340 > min: 226 -0.129883 > > 16384 avg: 455.111111 > total: 32768 > max: 502 0.103027 > min: 455 -0.127686 > > 32768 avg: 910.222222 > total: 65536 > max: 966 0.061279 > min: 910 -0.076050 > > With bigger pg_num, the gap between the maximum and the minimum decreased. > But it's unreasonable to set such large pg_num, which will increase > OSD and MON load. > > Is there any way to get a more balanced PG distribution of the cluster? > We tried "ceph osd reweight-by-pg 110 data" many times, but that can > not resolve the problem. > > Another problem is that if we can ensure the PG is distributed > balanced, can we ensure the data > distribution is balanced like PG? > > Btw, we will write data to this cluster until one or more osd get > full, we set full ratio to 0.98, > and we expect the cluster can use 0.9 total capacity. > > Any tips are welcome. > > -- > thanks > huangjun -- thanks huangjun -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html