Hi! It will be difficult to evenly distribute data with such difference in disk sizes. You can adjust a weight of most filled up ODSs with command #ceph osd reweight <osd_num> <new_weight> where new weight is a float in range 0.0-1.0. When you lower the weight of OSD, some PG will move from it to another location, so cluster rebalancing will happen. We have the same problem: in a cluster with 1tb and 2tb disks we got the lack of space at some moment. So we have to add some 9x4tb drives. After adding, the newer distribution lead us to a >85% filling of smaller (1tb) disk. Manually reweighting some of them to 0.8-0.9 helps us to lower filling to a safer <85% values. Notice that manually assigned weights are not preserved, if you remove osd and re-add later. Also you can read docs about 'ceph osd reweight-by-utilization' command for more-or-less automatic reweighting. We dont use it though. And there is another issue when you use different sized disks in a cluster: higher sized disks will get higher weights in crushmap, higher weight will lead to more PGs mapped to this OSDs, that leads to higher load. Our 4Tb disks was the most loaded disks in a cluster, almost 100% busy any time and limited cluster performance. So we take tham out and exchange 1tb drices to 2tb, thus more-or-less flattened weights, distribution and io load. Megov Igor CIO, Yuterra ________________________________________ От: ceph-users <ceph-users-bounces@xxxxxxxxxxxxxx> от имени Andras Pataki <apataki@xxxxxxxxxxxxxxxxxxxx> Отправлено: 21 сентября 2015 г. 22:23 Кому: Michael Hackett Копия: ceph-users@xxxxxxxxxxxxxx Тема: Re: Uneven data distribution across OSDs Hi Michael, I could certainly double the total PG count, but and it probably will reduce the discrepancies somewhat, but I wonder if it would be all that different. I could of course be very wrong. ceph osd dump |grep pool output: pool 0 'rbd' replicated size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 flags hashpspool stripe_width 0 pool 1 'data' replicated size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 512 pgp_num 512 last_change 3347 flags hashpspool stripe_width 0 pool 2 'cephfs_data' replicated size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 4096 pgp_num 4096 last_change 4288 flags hashpspool crash_replay_interval 45 stripe_width 0 pool 3 'cephfs_metadata' replicated size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 512 pgp_num 512 last_change 3349 flags hashpspool stripe_width 0 Only pool 2 has significant amount of data in it (99.9% of the data is there) (from ceph df): POOLS: NAME ID USED %USED MAX AVAIL OBJECTS rbd 0 0 0 18234G 0 data 1 0 0 18234G 0 cephfs_data 2 92083G 29.70 18234G 24155210 cephfs_metadata 3 60839k 0 18234G 36448 As for disk sizes, yes there are discrepancies, we have 1TB, 2TB and 6TB disks on various hosts (7 hosts not 8 as I said before). Two exceptions (osd.84 I reduced the weight because it filled up, osd.57 is a 5TB partition of a 6TB disk). All others are just the three disk sizes. The weights were set automatically accordingly at installation. The OSD tree: -1 295.55994 root default -6 21.84000 host scda002 0 0.90999 osd.0 up 1.00000 1.00000 10 0.90999 osd.10 up 1.00000 1.00000 11 0.90999 osd.11 up 1.00000 1.00000 12 0.90999 osd.12 up 1.00000 1.00000 13 0.90999 osd.13 up 1.00000 1.00000 14 0.90999 osd.14 up 1.00000 1.00000 15 0.90999 osd.15 up 1.00000 1.00000 16 0.90999 osd.16 up 1.00000 1.00000 32 1.81999 osd.32 up 1.00000 1.00000 33 1.81999 osd.33 up 1.00000 1.00000 34 1.81999 osd.34 up 1.00000 1.00000 35 1.81999 osd.35 up 1.00000 1.00000 36 1.81999 osd.36 up 1.00000 1.00000 37 1.81999 osd.37 up 1.00000 1.00000 38 1.81999 osd.38 up 1.00000 1.00000 39 1.81999 osd.39 up 1.00000 1.00000 -3 29.01999 host scda006 84 0.81000 osd.84 up 1.00000 1.00000 85 0.90999 osd.85 up 1.00000 1.00000 86 0.90999 osd.86 up 1.00000 1.00000 87 0.90999 osd.87 up 1.00000 1.00000 88 0.90999 osd.88 up 1.00000 1.00000 89 0.90999 osd.89 up 1.00000 1.00000 90 0.90999 osd.90 up 1.00000 1.00000 91 0.90999 osd.91 up 1.00000 1.00000 9 0.90999 osd.9 up 1.00000 1.00000 17 0.90999 osd.17 up 1.00000 1.00000 18 0.90999 osd.18 up 1.00000 1.00000 19 0.90999 osd.19 up 1.00000 1.00000 20 0.90999 osd.20 up 1.00000 1.00000 21 0.90999 osd.21 up 1.00000 1.00000 22 0.90999 osd.22 up 1.00000 1.00000 23 0.90999 osd.23 up 1.00000 1.00000 49 1.81999 osd.49 up 1.00000 1.00000 50 1.81999 osd.50 up 1.00000 1.00000 51 1.81999 osd.51 up 1.00000 1.00000 52 1.81999 osd.52 up 1.00000 1.00000 53 1.81999 osd.53 up 1.00000 1.00000 54 1.81999 osd.54 up 1.00000 1.00000 55 1.81999 osd.55 up 1.00000 1.00000 56 1.81999 osd.56 up 1.00000 1.00000 -2 70.98000 host scda005 79 5.45999 osd.79 up 1.00000 1.00000 80 5.45999 osd.80 up 1.00000 1.00000 81 5.45999 osd.81 up 1.00000 1.00000 82 5.45999 osd.82 up 1.00000 1.00000 83 5.45999 osd.83 up 1.00000 1.00000 40 5.45999 osd.40 up 1.00000 1.00000 41 5.45999 osd.41 up 1.00000 1.00000 42 5.45999 osd.42 up 1.00000 1.00000 43 5.45999 osd.43 up 1.00000 1.00000 44 5.45999 osd.44 up 1.00000 1.00000 45 5.45999 osd.45 up 1.00000 1.00000 46 5.45999 osd.46 up 1.00000 1.00000 47 5.45999 osd.47 up 1.00000 1.00000 -4 70.98000 host scda007 74 5.45999 osd.74 up 1.00000 1.00000 75 5.45999 osd.75 up 1.00000 1.00000 76 5.45999 osd.76 up 1.00000 1.00000 77 5.45999 osd.77 up 1.00000 1.00000 78 5.45999 osd.78 up 1.00000 1.00000 1 5.45999 osd.1 up 1.00000 1.00000 2 5.45999 osd.2 up 1.00000 1.00000 3 5.45999 osd.3 up 1.00000 1.00000 4 5.45999 osd.4 up 1.00000 1.00000 5 5.45999 osd.5 up 1.00000 1.00000 6 5.45999 osd.6 up 1.00000 1.00000 7 5.45999 osd.7 up 1.00000 1.00000 8 5.45999 osd.8 up 1.00000 1.00000 -5 81.89999 host scda008 67 5.45999 osd.67 up 1.00000 1.00000 68 5.45999 osd.68 up 1.00000 1.00000 69 5.45999 osd.69 up 1.00000 1.00000 70 5.45999 osd.70 up 1.00000 1.00000 71 5.45999 osd.71 up 1.00000 1.00000 72 5.45999 osd.72 up 1.00000 1.00000 73 5.45999 osd.73 up 1.00000 1.00000 24 5.45999 osd.24 up 1.00000 1.00000 25 5.45999 osd.25 up 1.00000 1.00000 26 5.45999 osd.26 up 1.00000 1.00000 27 5.45999 osd.27 up 1.00000 1.00000 28 5.45999 osd.28 up 1.00000 1.00000 29 5.45999 osd.29 up 1.00000 1.00000 30 5.45999 osd.30 up 1.00000 1.00000 31 5.45999 osd.31 up 1.00000 1.00000 -7 4.45999 host scda004 57 4.45999 osd.57 up 1.00000 1.00000 -8 16.37999 host scda011 58 1.81999 osd.58 up 1.00000 1.00000 59 1.81999 osd.59 up 1.00000 1.00000 60 1.81999 osd.60 up 1.00000 1.00000 61 1.81999 osd.61 up 1.00000 1.00000 62 1.81999 osd.62 up 1.00000 1.00000 63 1.81999 osd.63 up 1.00000 1.00000 64 1.81999 osd.64 up 1.00000 1.00000 65 1.81999 osd.65 up 1.00000 1.00000 66 1.81999 osd.66 up 1.00000 1.00000 When I was mentioning uneven distribution of data, I was dividing the number of PGs per OSD with the weight of the OSD of course. I have not touched tunables, they are presumably set to the defaults at installation. Here is what I get for the show-tunables command: { "choose_local_tries": 0, "choose_local_fallback_tries": 0, "choose_total_tries": 50, "chooseleaf_descend_once": 1, "chooseleaf_vary_r": 0, "straw_calc_version": 1, "allowed_bucket_algs": 22, "profile": "unknown", "optimal_tunables": 0, "legacy_tunables": 0, "require_feature_tunables": 1, "require_feature_tunables2": 1, "require_feature_tunables3": 0, "has_v2_rules": 0, "has_v3_rules": 0, "has_v4_buckets": 0 } ceph -v ceph version 0.94.3 (95cefea9fd9ab740263bf8bb4796fd864d9afe2b) Here are the disk utilizations - ranging from 34% for osd.20 to 87% for osd.84 and osd.9 ceph osd df ID WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR 0 0.90999 1.00000 931G 545G 385G 58.64 0.98 10 0.90999 1.00000 931G 725G 205G 77.95 1.31 11 0.90999 1.00000 931G 432G 498G 46.47 0.78 12 0.90999 1.00000 931G 432G 498G 46.49 0.78 13 0.90999 1.00000 931G 660G 270G 70.93 1.19 14 0.90999 1.00000 931G 455G 475G 48.94 0.82 15 0.90999 1.00000 931G 660G 270G 70.93 1.19 16 0.90999 1.00000 931G 680G 250G 73.06 1.22 32 1.81999 1.00000 1862G 1352G 509G 72.65 1.22 33 1.81999 1.00000 1862G 1065G 796G 57.22 0.96 34 1.81999 1.00000 1862G 1128G 733G 60.61 1.02 35 1.81999 1.00000 1862G 1269G 592G 68.18 1.14 36 1.81999 1.00000 1862G 1398G 464G 75.08 1.26 37 1.81999 1.00000 1862G 1172G 689G 62.98 1.06 38 1.81999 1.00000 1862G 1176G 685G 63.16 1.06 39 1.81999 1.00000 1862G 1220G 641G 65.55 1.10 84 0.81000 1.00000 931G 816G 114G 87.73 1.47 85 0.90999 1.00000 931G 769G 161G 82.67 1.39 86 0.90999 1.00000 931G 704G 226G 75.63 1.27 87 0.90999 1.00000 931G 638G 292G 68.55 1.15 88 0.90999 1.00000 931G 523G 407G 56.28 0.94 89 0.90999 1.00000 931G 502G 428G 53.96 0.90 90 0.90999 1.00000 931G 729G 201G 78.33 1.31 91 0.90999 1.00000 931G 548G 383G 58.86 0.99 9 0.90999 1.00000 931G 818G 112G 87.94 1.47 17 0.90999 1.00000 931G 479G 451G 51.50 0.86 18 0.90999 1.00000 931G 547G 383G 58.78 0.99 19 0.90999 1.00000 931G 637G 293G 68.46 1.15 20 0.90999 1.00000 931G 322G 608G 34.69 0.58 21 0.90999 1.00000 931G 523G 407G 56.20 0.94 22 0.90999 1.00000 931G 615G 315G 66.12 1.11 23 0.90999 1.00000 931G 480G 450G 51.56 0.86 49 1.81999 1.00000 1862G 1467G 394G 78.83 1.32 50 1.81999 1.00000 1862G 1198G 663G 64.38 1.08 51 1.81999 1.00000 1862G 1087G 774G 58.41 0.98 52 1.81999 1.00000 1862G 1174G 687G 63.09 1.06 53 1.81999 1.00000 1862G 1246G 615G 66.96 1.12 54 1.81999 1.00000 1862G 771G 1090G 41.43 0.69 55 1.81999 1.00000 1862G 885G 976G 47.58 0.80 56 1.81999 1.00000 1862G 1489G 373G 79.96 1.34 79 5.45999 1.00000 5588G 3441G 2146G 61.59 1.03 80 5.45999 1.00000 5588G 3427G 2160G 61.33 1.03 81 5.45999 1.00000 5588G 3607G 1980G 64.55 1.08 82 5.45999 1.00000 5588G 3311G 2276G 59.26 0.99 83 5.45999 1.00000 5588G 3295G 2292G 58.98 0.99 40 5.45999 1.00000 5587G 3548G 2038G 63.51 1.06 41 5.45999 1.00000 5587G 3471G 2115G 62.13 1.04 42 5.45999 1.00000 5587G 3540G 2046G 63.37 1.06 43 5.45999 1.00000 5587G 3356G 2230G 60.07 1.01 44 5.45999 1.00000 5587G 3113G 2473G 55.72 0.93 45 5.45999 1.00000 5587G 3426G 2160G 61.33 1.03 46 5.45999 1.00000 5587G 3136G 2451G 56.13 0.94 47 5.45999 1.00000 5587G 3222G 2364G 57.67 0.97 74 5.45999 1.00000 5588G 3536G 2051G 63.28 1.06 75 5.45999 1.00000 5588G 3672G 1915G 65.72 1.10 76 5.45999 1.00000 5588G 3784G 1803G 67.73 1.14 77 5.45999 1.00000 5588G 3652G 1935G 65.36 1.10 78 5.45999 1.00000 5588G 3291G 2297G 58.89 0.99 1 5.45999 1.00000 5587G 3200G 2386G 57.28 0.96 2 5.45999 1.00000 5587G 2680G 2906G 47.98 0.80 3 5.45999 1.00000 5587G 3382G 2204G 60.54 1.01 4 5.45999 1.00000 5587G 3095G 2491G 55.41 0.93 5 5.45999 1.00000 5587G 3851G 1735G 68.94 1.16 6 5.45999 1.00000 5587G 3312G 2274G 59.29 0.99 7 5.45999 1.00000 5587G 2884G 2702G 51.63 0.87 8 5.45999 1.00000 5587G 3407G 2179G 60.98 1.02 67 5.45999 1.00000 5587G 3452G 2134G 61.80 1.04 68 5.45999 1.00000 5587G 2780G 2806G 49.76 0.83 69 5.45999 1.00000 5587G 3337G 2249G 59.74 1.00 70 5.45999 1.00000 5587G 3578G 2008G 64.06 1.07 71 5.45999 1.00000 5587G 3358G 2228G 60.12 1.01 72 5.45999 1.00000 5587G 3021G 2565G 54.08 0.91 73 5.45999 1.00000 5587G 3160G 2426G 56.57 0.95 24 5.45999 1.00000 5587G 3085G 2501G 55.22 0.93 25 5.45999 1.00000 5587G 3495G 2091G 62.56 1.05 26 5.45999 1.00000 5587G 3141G 2445G 56.22 0.94 27 5.45999 1.00000 5587G 3897G 1689G 69.76 1.17 28 5.45999 1.00000 5587G 3243G 2343G 58.05 0.97 29 5.45999 1.00000 5587G 2907G 2679G 52.05 0.87 30 5.45999 1.00000 5587G 3788G 1798G 67.81 1.14 31 5.45999 1.00000 5587G 3289G 2297G 58.88 0.99 57 4.45999 1.00000 4563G 2824G 1738G 61.90 1.04 58 1.81999 1.00000 1862G 1267G 594G 68.09 1.14 59 1.81999 1.00000 1862G 1064G 798G 57.14 0.96 60 1.81999 1.00000 1862G 1468G 393G 78.86 1.32 61 1.81999 1.00000 1862G 1219G 642G 65.50 1.10 62 1.81999 1.00000 1862G 1175G 686G 63.13 1.06 63 1.81999 1.00000 1862G 1290G 571G 69.32 1.16 64 1.81999 1.00000 1862G 1358G 503G 72.96 1.22 65 1.81999 1.00000 1862G 1401G 460G 75.28 1.26 66 1.81999 1.00000 1862G 1309G 552G 70.31 1.18 Thanks for the help :) Andras On 9/21/15, 2:55 PM, "Michael Hackett" <mhackett@xxxxxxxxxx> wrote: >Hello Andras, > >Some initial observations and questions: > >The total PG recommendation for this cluster would actually be 8192 PGs >per the formula. > >Total PG's = (90 * 100) / 2 = 4500 > >Next power of 2 = 8192. > >The result should be rounded up to the nearest power of two. Rounding up >is optional, but recommended for CRUSH to evenly balance the number of >objects among placement groups. > >How many data pools are being used for storing objects? > >'ceph osd dump |grep pool' > >Also how are these 90 OSD's laid out across the 8 hosts and is there any >discrepancy between disk sizes and weight? > >'ceph osd tree' > >Also what are you using for CRUSH tunables and what Ceph release? > >'ceph osd crush show-tunables' >'ceph -v' > >Thanks, > >----- Original Message ----- >From: "Andras Pataki" <apataki@xxxxxxxxxxxxxxxxxxxx> >To: ceph-users@xxxxxxxxxxxxxx >Sent: Monday, September 21, 2015 2:00:29 PM >Subject: Uneven data distribution across OSDs > >Hi ceph users, > >I am using CephFS for file storage and I have noticed that the data gets >distributed very unevenly across OSDs. > > > * I have about 90 OSDs across 8 hosts, and 4096 PGs for the >cephfs_data pool with 2 replicas, which is in line with the total PG >recommendation if ³Total PGs = (OSDs * 100) / pool_size² from the docs. > * CephFS distributes the data pretty much evenly across the PGs as >shown by Œceph pg dump¹ > * However the number of PGs assigned to various OSDs (per weight >unit/terabyte) varies quite a lot. The fullest OSD has as many as 44 PGs >per terabyte (weight unit), while the emptier ones have as few as 19 or >20. > * Even if I consider the total number of PGs for all pools per OSD, >the number varies similarly wildly (as with the cephfs_data pool only). >As a result, when the whole CephFS file system is at 60% full, some of >the OSDs already reach the 95% full condition, and no more data can be >written to the system. >Is there any way to force a more even distribution of PGs to OSDs? I am >using the default crush map, with two levels (root/host). Can any changes >to the crush map help? I would really like to be get higher disk >utilization than 60% without 1 of 90 disks filling up so early. > >Thanks, > >Andras > > >_______________________________________________ >ceph-users mailing list >ceph-users@xxxxxxxxxxxxxx >http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >-- _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com