НА: Uneven data distribution across OSDs

Межов Игорь Александрович <megov@xxxxxxxxxx> · Tue, 22 Sep 2015 07:35:34 +0000

Hi!

It will be difficult to evenly distribute data with such difference in disk sizes.
You can adjust a weight of most filled up ODSs with command

#ceph osd reweight <osd_num> <new_weight>

where new weight is a float in range 0.0-1.0. When you lower the weight
of OSD, some PG will move from it to another location, so cluster rebalancing
will happen. We have the same problem: in a cluster with 1tb and 2tb disks 
we got the lack of space at some moment. So we have to add some 9x4tb 
drives. After adding, the newer distribution lead us to a >85% filling of 
smaller (1tb) disk. Manually reweighting some of them to 0.8-0.9 helps
us to lower filling to a safer <85% values. 
Notice that manually assigned weights are not preserved, if you remove
osd and re-add later. 
Also you can read docs about 'ceph osd reweight-by-utilization' command
for more-or-less automatic reweighting. We dont use it though.

And there is another issue when you use different sized disks in a cluster:
higher sized disks will get higher weights in crushmap, higher weight will 
lead to more PGs mapped to this OSDs, that leads to higher load. 

Our 4Tb disks was the most loaded disks in a cluster, almost 100% busy
any time and limited cluster performance. So we take tham out and exchange
1tb drices to 2tb, thus more-or-less flattened weights, distribution and io load.

Megov Igor
CIO, Yuterra

________________________________________
От: ceph-users <ceph-users-bounces@xxxxxxxxxxxxxx> от имени Andras Pataki <apataki@xxxxxxxxxxxxxxxxxxxx>
Отправлено: 21 сентября 2015 г. 22:23
Кому: Michael Hackett
Копия: ceph-users@xxxxxxxxxxxxxx
Тема: Re:  Uneven data distribution across OSDs

Hi Michael,

I could certainly double the total PG count, but and it probably will
reduce the discrepancies somewhat, but I wonder if it would be all that
different.  I could of course be very wrong.

ceph osd dump |grep pool output:

pool 0 'rbd' replicated size 2 min_size 1 crush_ruleset 0 object_hash
rjenkins pg_num 64 pgp_num 64 last_change 1 flags hashpspool stripe_width 0
pool 1 'data' replicated size 2 min_size 1 crush_ruleset 0 object_hash
rjenkins pg_num 512 pgp_num 512 last_change 3347 flags hashpspool
stripe_width 0
pool 2 'cephfs_data' replicated size 2 min_size 1 crush_ruleset 0
object_hash rjenkins pg_num 4096 pgp_num 4096 last_change 4288 flags
hashpspool crash_replay_interval 45 stripe_width 0
pool 3 'cephfs_metadata' replicated size 2 min_size 1 crush_ruleset 0
object_hash rjenkins pg_num 512 pgp_num 512 last_change 3349 flags
hashpspool stripe_width 0

Only pool 2 has significant amount of data in it (99.9% of the data is
there) (from ceph df):

POOLS:
    NAME                ID     USED       %USED     MAX AVAIL     OBJECTS
    rbd                 0           0         0        18234G            0
    data                1           0         0        18234G            0
    cephfs_data         2      92083G     29.70        18234G     24155210
    cephfs_metadata     3      60839k         0        18234G        36448

As for disk sizes, yes there are discrepancies, we have 1TB, 2TB and 6TB
disks on various hosts (7 hosts not 8 as I said before).  Two exceptions
(osd.84 I reduced the weight because it filled up, osd.57 is a 5TB
partition of a 6TB disk).  All others are just the three disk sizes.  The
weights were set automatically accordingly at installation.  The OSD tree:

 -1 295.55994 root default
 -6  21.84000     host scda002
  0   0.90999         osd.0               up  1.00000          1.00000
 10   0.90999         osd.10              up  1.00000          1.00000
 11   0.90999         osd.11              up  1.00000          1.00000
 12   0.90999         osd.12              up  1.00000          1.00000
 13   0.90999         osd.13              up  1.00000          1.00000
 14   0.90999         osd.14              up  1.00000          1.00000
 15   0.90999         osd.15              up  1.00000          1.00000
 16   0.90999         osd.16              up  1.00000          1.00000
 32   1.81999         osd.32              up  1.00000          1.00000
 33   1.81999         osd.33              up  1.00000          1.00000
 34   1.81999         osd.34              up  1.00000          1.00000
 35   1.81999         osd.35              up  1.00000          1.00000
 36   1.81999         osd.36              up  1.00000          1.00000
 37   1.81999         osd.37              up  1.00000          1.00000
 38   1.81999         osd.38              up  1.00000          1.00000
 39   1.81999         osd.39              up  1.00000          1.00000
 -3  29.01999     host scda006
 84   0.81000         osd.84              up  1.00000          1.00000
 85   0.90999         osd.85              up  1.00000          1.00000
 86   0.90999         osd.86              up  1.00000          1.00000
 87   0.90999         osd.87              up  1.00000          1.00000
 88   0.90999         osd.88              up  1.00000          1.00000
 89   0.90999         osd.89              up  1.00000          1.00000
 90   0.90999         osd.90              up  1.00000          1.00000
 91   0.90999         osd.91              up  1.00000          1.00000
  9   0.90999         osd.9               up  1.00000          1.00000
 17   0.90999         osd.17              up  1.00000          1.00000
 18   0.90999         osd.18              up  1.00000          1.00000
 19   0.90999         osd.19              up  1.00000          1.00000
 20   0.90999         osd.20              up  1.00000          1.00000
 21   0.90999         osd.21              up  1.00000          1.00000
 22   0.90999         osd.22              up  1.00000          1.00000
 23   0.90999         osd.23              up  1.00000          1.00000
 49   1.81999         osd.49              up  1.00000          1.00000
 50   1.81999         osd.50              up  1.00000          1.00000
 51   1.81999         osd.51              up  1.00000          1.00000
 52   1.81999         osd.52              up  1.00000          1.00000
 53   1.81999         osd.53              up  1.00000          1.00000
 54   1.81999         osd.54              up  1.00000          1.00000
 55   1.81999         osd.55              up  1.00000          1.00000
 56   1.81999         osd.56              up  1.00000          1.00000
 -2  70.98000     host scda005
 79   5.45999         osd.79              up  1.00000          1.00000
 80   5.45999         osd.80              up  1.00000          1.00000
 81   5.45999         osd.81              up  1.00000          1.00000
 82   5.45999         osd.82              up  1.00000          1.00000
 83   5.45999         osd.83              up  1.00000          1.00000
 40   5.45999         osd.40              up  1.00000          1.00000
 41   5.45999         osd.41              up  1.00000          1.00000
 42   5.45999         osd.42              up  1.00000          1.00000
 43   5.45999         osd.43              up  1.00000          1.00000
 44   5.45999         osd.44              up  1.00000          1.00000
 45   5.45999         osd.45              up  1.00000          1.00000
 46   5.45999         osd.46              up  1.00000          1.00000
 47   5.45999         osd.47              up  1.00000          1.00000
 -4  70.98000     host scda007
 74   5.45999         osd.74              up  1.00000          1.00000
 75   5.45999         osd.75              up  1.00000          1.00000
 76   5.45999         osd.76              up  1.00000          1.00000
 77   5.45999         osd.77              up  1.00000          1.00000
 78   5.45999         osd.78              up  1.00000          1.00000
  1   5.45999         osd.1               up  1.00000          1.00000
  2   5.45999         osd.2               up  1.00000          1.00000
  3   5.45999         osd.3               up  1.00000          1.00000
  4   5.45999         osd.4               up  1.00000          1.00000
  5   5.45999         osd.5               up  1.00000          1.00000
  6   5.45999         osd.6               up  1.00000          1.00000
  7   5.45999         osd.7               up  1.00000          1.00000
  8   5.45999         osd.8               up  1.00000          1.00000
 -5  81.89999     host scda008
 67   5.45999         osd.67              up  1.00000          1.00000
 68   5.45999         osd.68              up  1.00000          1.00000
 69   5.45999         osd.69              up  1.00000          1.00000
 70   5.45999         osd.70              up  1.00000          1.00000
 71   5.45999         osd.71              up  1.00000          1.00000
 72   5.45999         osd.72              up  1.00000          1.00000
 73   5.45999         osd.73              up  1.00000          1.00000
 24   5.45999         osd.24              up  1.00000          1.00000
 25   5.45999         osd.25              up  1.00000          1.00000
 26   5.45999         osd.26              up  1.00000          1.00000
 27   5.45999         osd.27              up  1.00000          1.00000
 28   5.45999         osd.28              up  1.00000          1.00000
 29   5.45999         osd.29              up  1.00000          1.00000
 30   5.45999         osd.30              up  1.00000          1.00000
 31   5.45999         osd.31              up  1.00000          1.00000
 -7   4.45999     host scda004
 57   4.45999         osd.57              up  1.00000          1.00000
 -8  16.37999     host scda011
 58   1.81999         osd.58              up  1.00000          1.00000
 59   1.81999         osd.59              up  1.00000          1.00000
 60   1.81999         osd.60              up  1.00000          1.00000
 61   1.81999         osd.61              up  1.00000          1.00000
 62   1.81999         osd.62              up  1.00000          1.00000
 63   1.81999         osd.63              up  1.00000          1.00000
 64   1.81999         osd.64              up  1.00000          1.00000
 65   1.81999         osd.65              up  1.00000          1.00000
 66   1.81999         osd.66              up  1.00000          1.00000

When I was mentioning uneven distribution of data, I was dividing the
number of PGs per OSD with the weight of the OSD of course.

I have not touched tunables, they are presumably set to the defaults at
installation.  Here is what I get for the show-tunables command:

{
    "choose_local_tries": 0,
    "choose_local_fallback_tries": 0,
    "choose_total_tries": 50,
    "chooseleaf_descend_once": 1,
    "chooseleaf_vary_r": 0,
    "straw_calc_version": 1,
    "allowed_bucket_algs": 22,
    "profile": "unknown",
    "optimal_tunables": 0,
    "legacy_tunables": 0,
    "require_feature_tunables": 1,
    "require_feature_tunables2": 1,
    "require_feature_tunables3": 0,
    "has_v2_rules": 0,
    "has_v3_rules": 0,
    "has_v4_buckets": 0
}

ceph -v
ceph version 0.94.3 (95cefea9fd9ab740263bf8bb4796fd864d9afe2b)

Here are the disk utilizations - ranging from 34% for osd.20 to 87% for
osd.84 and osd.9

ceph osd df

ID WEIGHT  REWEIGHT SIZE  USE    AVAIL %USE  VAR
0 0.90999  1.00000  931G   545G  385G 58.64 0.98
10 0.90999  1.00000  931G   725G  205G 77.95 1.31
11 0.90999  1.00000  931G   432G  498G 46.47 0.78
12 0.90999  1.00000  931G   432G  498G 46.49 0.78
13 0.90999  1.00000  931G   660G  270G 70.93 1.19
14 0.90999  1.00000  931G   455G  475G 48.94 0.82
15 0.90999  1.00000  931G   660G  270G 70.93 1.19
16 0.90999  1.00000  931G   680G  250G 73.06 1.22
32 1.81999  1.00000 1862G  1352G  509G 72.65 1.22
33 1.81999  1.00000 1862G  1065G  796G 57.22 0.96
34 1.81999  1.00000 1862G  1128G  733G 60.61 1.02
35 1.81999  1.00000 1862G  1269G  592G 68.18 1.14
36 1.81999  1.00000 1862G  1398G  464G 75.08 1.26
37 1.81999  1.00000 1862G  1172G  689G 62.98 1.06
38 1.81999  1.00000 1862G  1176G  685G 63.16 1.06
39 1.81999  1.00000 1862G  1220G  641G 65.55 1.10
84 0.81000  1.00000  931G   816G  114G 87.73 1.47
85 0.90999  1.00000  931G   769G  161G 82.67 1.39
86 0.90999  1.00000  931G   704G  226G 75.63 1.27
87 0.90999  1.00000  931G   638G  292G 68.55 1.15
88 0.90999  1.00000  931G   523G  407G 56.28 0.94
89 0.90999  1.00000  931G   502G  428G 53.96 0.90
90 0.90999  1.00000  931G   729G  201G 78.33 1.31
91 0.90999  1.00000  931G   548G  383G 58.86 0.99
 9 0.90999  1.00000  931G   818G  112G 87.94 1.47
17 0.90999  1.00000  931G   479G  451G 51.50 0.86
18 0.90999  1.00000  931G   547G  383G 58.78 0.99
19 0.90999  1.00000  931G   637G  293G 68.46 1.15
20 0.90999  1.00000  931G   322G  608G 34.69 0.58
21 0.90999  1.00000  931G   523G  407G 56.20 0.94
22 0.90999  1.00000  931G   615G  315G 66.12 1.11
23 0.90999  1.00000  931G   480G  450G 51.56 0.86
49 1.81999  1.00000 1862G  1467G  394G 78.83 1.32
50 1.81999  1.00000 1862G  1198G  663G 64.38 1.08
51 1.81999  1.00000 1862G  1087G  774G 58.41 0.98
52 1.81999  1.00000 1862G  1174G  687G 63.09 1.06
53 1.81999  1.00000 1862G  1246G  615G 66.96 1.12
54 1.81999  1.00000 1862G   771G 1090G 41.43 0.69
55 1.81999  1.00000 1862G   885G  976G 47.58 0.80
56 1.81999  1.00000 1862G  1489G  373G 79.96 1.34
79 5.45999  1.00000 5588G  3441G 2146G 61.59 1.03
80 5.45999  1.00000 5588G  3427G 2160G 61.33 1.03
81 5.45999  1.00000 5588G  3607G 1980G 64.55 1.08
82 5.45999  1.00000 5588G  3311G 2276G 59.26 0.99
83 5.45999  1.00000 5588G  3295G 2292G 58.98 0.99
40 5.45999  1.00000 5587G  3548G 2038G 63.51 1.06
41 5.45999  1.00000 5587G  3471G 2115G 62.13 1.04
42 5.45999  1.00000 5587G  3540G 2046G 63.37 1.06
43 5.45999  1.00000 5587G  3356G 2230G 60.07 1.01
44 5.45999  1.00000 5587G  3113G 2473G 55.72 0.93
45 5.45999  1.00000 5587G  3426G 2160G 61.33 1.03
46 5.45999  1.00000 5587G  3136G 2451G 56.13 0.94
47 5.45999  1.00000 5587G  3222G 2364G 57.67 0.97
74 5.45999  1.00000 5588G  3536G 2051G 63.28 1.06
75 5.45999  1.00000 5588G  3672G 1915G 65.72 1.10
76 5.45999  1.00000 5588G  3784G 1803G 67.73 1.14
77 5.45999  1.00000 5588G  3652G 1935G 65.36 1.10
78 5.45999  1.00000 5588G  3291G 2297G 58.89 0.99
 1 5.45999  1.00000 5587G  3200G 2386G 57.28 0.96
 2 5.45999  1.00000 5587G  2680G 2906G 47.98 0.80
 3 5.45999  1.00000 5587G  3382G 2204G 60.54 1.01
 4 5.45999  1.00000 5587G  3095G 2491G 55.41 0.93
 5 5.45999  1.00000 5587G  3851G 1735G 68.94 1.16
 6 5.45999  1.00000 5587G  3312G 2274G 59.29 0.99
 7 5.45999  1.00000 5587G  2884G 2702G 51.63 0.87
 8 5.45999  1.00000 5587G  3407G 2179G 60.98 1.02
67 5.45999  1.00000 5587G  3452G 2134G 61.80 1.04
68 5.45999  1.00000 5587G  2780G 2806G 49.76 0.83
69 5.45999  1.00000 5587G  3337G 2249G 59.74 1.00
70 5.45999  1.00000 5587G  3578G 2008G 64.06 1.07
71 5.45999  1.00000 5587G  3358G 2228G 60.12 1.01
72 5.45999  1.00000 5587G  3021G 2565G 54.08 0.91
73 5.45999  1.00000 5587G  3160G 2426G 56.57 0.95
24 5.45999  1.00000 5587G  3085G 2501G 55.22 0.93
25 5.45999  1.00000 5587G  3495G 2091G 62.56 1.05
26 5.45999  1.00000 5587G  3141G 2445G 56.22 0.94
27 5.45999  1.00000 5587G  3897G 1689G 69.76 1.17
28 5.45999  1.00000 5587G  3243G 2343G 58.05 0.97
29 5.45999  1.00000 5587G  2907G 2679G 52.05 0.87
30 5.45999  1.00000 5587G  3788G 1798G 67.81 1.14
31 5.45999  1.00000 5587G  3289G 2297G 58.88 0.99
57 4.45999  1.00000 4563G  2824G 1738G 61.90 1.04
58 1.81999  1.00000 1862G  1267G  594G 68.09 1.14
59 1.81999  1.00000 1862G  1064G  798G 57.14 0.96
60 1.81999  1.00000 1862G  1468G  393G 78.86 1.32
61 1.81999  1.00000 1862G  1219G  642G 65.50 1.10
62 1.81999  1.00000 1862G  1175G  686G 63.13 1.06
63 1.81999  1.00000 1862G  1290G  571G 69.32 1.16
64 1.81999  1.00000 1862G  1358G  503G 72.96 1.22
65 1.81999  1.00000 1862G  1401G  460G 75.28 1.26
66 1.81999  1.00000 1862G  1309G  552G 70.31 1.18

Thanks for the help :)

Andras

On 9/21/15, 2:55 PM, "Michael Hackett" <mhackett@xxxxxxxxxx> wrote:

>Hello Andras,
>
>Some initial observations and questions:
>
>The total PG recommendation for this cluster would actually be 8192 PGs
>per the formula.
>
>Total PG's = (90 * 100) / 2 = 4500
>
>Next power of 2 = 8192.
>
>The result should be rounded up to the nearest power of two. Rounding up
>is optional, but recommended for CRUSH to evenly balance the number of
>objects among placement groups.
>
>How many data pools are being used for storing objects?
>
>'ceph osd dump |grep pool'
>
>Also how are these 90 OSD's laid out across the 8 hosts and is there any
>discrepancy between disk sizes and weight?
>
>'ceph osd tree'
>
>Also what are you using for CRUSH tunables and what Ceph release?
>
>'ceph osd crush show-tunables'
>'ceph -v'
>
>Thanks,
>
>----- Original Message -----
>From: "Andras Pataki" <apataki@xxxxxxxxxxxxxxxxxxxx>
>To: ceph-users@xxxxxxxxxxxxxx
>Sent: Monday, September 21, 2015 2:00:29 PM
>Subject:  Uneven data distribution across OSDs
>
>Hi ceph users,
>
>I am using CephFS for file storage and I have noticed that the data gets
>distributed very unevenly across OSDs.
>
>
>    * I have about 90 OSDs across 8 hosts, and 4096 PGs for the
>cephfs_data pool with 2 replicas, which is in line with the total PG
>recommendation if ³Total PGs = (OSDs * 100) / pool_size² from the docs.
>    * CephFS distributes the data pretty much evenly across the PGs as
>shown by Œceph pg dump¹
>    * However  the number of PGs assigned to various OSDs (per weight
>unit/terabyte) varies quite a lot. The fullest OSD has as many as 44 PGs
>per terabyte (weight unit), while the emptier ones have as few as 19 or
>20.
>    * Even if I consider the total number of PGs for all pools per OSD,
>the number varies similarly wildly (as with the cephfs_data pool only).
>As a result, when the whole CephFS file system is at 60% full, some of
>the OSDs already reach the 95% full condition, and no more data can be
>written to the system.
>Is there any way to force a more even distribution of PGs to OSDs? I am
>using the default crush map, with two levels (root/host). Can any changes
>to the crush map help? I would really like to be get higher disk
>utilization than 60% without 1 of 90 disks filling up so early.
>
>Thanks,
>
>Andras
>
>
>_______________________________________________
>ceph-users mailing list
>ceph-users@xxxxxxxxxxxxxx
>http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>--

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com