Re: OSDs are not utilized evenly

Joseph Mundackal <joseph.j.mundackal@xxxxxxxxx> · Fri, 4 Nov 2022 17:45:09 -0400

Hi Denis,

can you share the following data points?

ceph osd df tree (to see how the osd's are distributed)
ceph osd crush rule dump (to see what your ec rule looks like)
ceph osd pool ls detail (to see the pools and pools to crush rule mapping
and pg nums)

Also
    "optimize_result": "Unable to find further optimization, or pool(s)
pg_num is decreasing, or distribution is already perfect
is the auto scaler currently adjusting your pg counts?

-Joseph

On Wed, Nov 2, 2022 at 5:01 PM Denis Polom <denispolom@xxxxxxxxx> wrote:

> Hi Joseph,
>
> thank you for answer. But if I'm looking correctly to 'ceph osd df' output
> I posted I see there are about 195 PGs per OSD.
>
> There are 608 OSDs in the pool, which is the only data pool. What I have
> calculated - PG calc says that PG number is fine.
>
>
> On 11/1/22 14:03, Joseph Mundackal wrote:
>
> If the GB per pg is high, the balancer module won't be able to help.
>
> Your pg count per osd also looks low (30's), so increasing pgs per pool
> would help with both problems.
>
> You can use the pg calculator to determine which pools need what
>
> On Tue, Nov 1, 2022, 08:46 Denis Polom <denispolom@xxxxxxxxx> wrote:
>
>> Hi
>>
>> I observed on my Ceph cluster running latest Pacific that same size OSDs
>> are utilized differently even if balancer is running and reports status
>> as perfectly balanced.
>>
>> {
>>      "active": true,
>>      "last_optimize_duration": "0:00:00.622467",
>>      "last_optimize_started": "Tue Nov  1 12:49:36 2022",
>>      "mode": "upmap",
>>      "optimize_result": "Unable to find further optimization, or pool(s)
>> pg_num is decreasing, or distribution is already perfect",
>>      "plans": []
>> }
>>
>> balancer settings for upmap are:
>>
>>    mgr           advanced
>> mgr/balancer/mode                               upmap
>>    mgr           advanced mgr/balancer/upmap_max_deviation
>> 1
>>    mgr           advanced mgr/balancer/upmap_max_optimizations
>> 20
>>
>> It's obvious that utilization is not same (difference is about 1TB) from
>> command `ceph osd df`. Following is just a partial output:
>>
>> ID   CLASS  WEIGHT    REWEIGHT  SIZE     RAW USE  DATA OMAP
>> META     AVAIL    %USE   VAR   PGS  STATUS
>>    0    hdd  18.00020   1.00000   16 TiB   13 TiB   13 TiB   3.0 MiB
>> 37 GiB  3.6 TiB  78.09  1.05  196      up
>> 124    hdd  18.00020   1.00000   16 TiB   12 TiB   12 TiB       0 B   32
>> GiB  4.7 TiB  71.20  0.96  195      up
>> 157    hdd  18.00020   1.00000   16 TiB   13 TiB   13 TiB   5.3 MiB   35
>> GiB  3.7 TiB  77.67  1.05  195      up
>>    1    hdd  18.00020   1.00000   16 TiB   13 TiB   13 TiB   2.0 MiB
>> 35 GiB  3.7 TiB  77.69  1.05  195      up
>> 243    hdd  18.00020   1.00000   16 TiB   12 TiB   12 TiB       0 B   31
>> GiB  4.7 TiB  71.16  0.96  195      up
>> 244    hdd  18.00020   1.00000   16 TiB   12 TiB   12 TiB       0 B   31
>> GiB  4.7 TiB  71.19  0.96  195      up
>> 245    hdd  18.00020   1.00000   16 TiB   12 TiB   12 TiB       0 B   32
>> GiB  4.7 TiB  71.55  0.96  196      up
>> 246    hdd  18.00020   1.00000   16 TiB   12 TiB   12 TiB       0 B   31
>> GiB  4.7 TiB  71.17  0.96  195      up
>> 249    hdd  18.00020   1.00000   16 TiB   12 TiB   12 TiB       0 B   30
>> GiB  4.7 TiB  71.18  0.96  195      up
>> 500    hdd  18.00020   1.00000   16 TiB   12 TiB   12 TiB       0 B   30
>> GiB  4.7 TiB  71.19  0.96  195      up
>> 501    hdd  18.00020   1.00000   16 TiB   12 TiB   12 TiB       0 B   31
>> GiB  4.7 TiB  71.57  0.96  196      up
>> 502    hdd  18.00020   1.00000   16 TiB   12 TiB   12 TiB       0 B   31
>> GiB  4.7 TiB  71.18  0.96  195      up
>> 532    hdd  18.00020   1.00000   16 TiB   12 TiB   12 TiB       0 B   31
>> GiB  4.7 TiB  71.16  0.96  195      up
>> 549    hdd  18.00020   1.00000   16 TiB   13 TiB   13 TiB   576 KiB   36
>> GiB  3.7 TiB  77.70  1.05  195      up
>> 550    hdd  18.00020   1.00000   16 TiB   13 TiB   13 TiB   3.8 MiB   36
>> GiB  3.7 TiB  77.67  1.05  195      up
>> 551    hdd  18.00020   1.00000   16 TiB   13 TiB   13 TiB   2.4 MiB   35
>> GiB  3.7 TiB  77.68  1.05  195      up
>> 552    hdd  18.00020   1.00000   16 TiB   13 TiB   13 TiB   5.5 MiB   35
>> GiB  3.7 TiB  77.69  1.05  195      up
>> 553    hdd  18.00020   1.00000   16 TiB   13 TiB   13 TiB   5.1 MiB   37
>> GiB  3.6 TiB  77.71  1.05  195      up
>> 554    hdd  18.00020   1.00000   16 TiB   13 TiB   13 TiB   967 KiB   36
>> GiB  3.6 TiB  77.71  1.05  195      up
>> 555    hdd  18.00020   1.00000   16 TiB   13 TiB   13 TiB   1.3 MiB   36
>> GiB  3.6 TiB  78.08  1.05  196      up
>> 556    hdd  18.00020   1.00000   16 TiB   13 TiB   13 TiB   4.7 MiB   36
>> GiB  3.6 TiB  78.10  1.05  196      up
>> 557    hdd  18.00020   1.00000   16 TiB   13 TiB   13 TiB   2.4 MiB   36
>> GiB  3.7 TiB  77.69  1.05  195      up
>> 558    hdd  18.00020   1.00000   16 TiB   13 TiB   13 TiB   4.5 MiB   36
>> GiB  3.6 TiB  77.72  1.05  195      up
>> 559    hdd  18.00020   1.00000   16 TiB   13 TiB   13 TiB   1.5 MiB   35
>> GiB  3.6 TiB  78.09  1.05  196      up
>> 560    hdd  18.00020   1.00000   16 TiB   13 TiB   13 TiB   5.2 MiB   35
>> GiB  3.7 TiB  77.69  1.05  195      up
>> 561    hdd  18.00020   1.00000   16 TiB   13 TiB   13 TiB   2.8 MiB   35
>> GiB  3.7 TiB  77.69  1.05  195      up
>> 562    hdd  18.00020   1.00000   16 TiB   13 TiB   13 TiB   1.0 MiB   36
>> GiB  3.7 TiB  77.68  1.05  195      up
>> 563    hdd  18.00020   1.00000   16 TiB   13 TiB   13 TiB   2.6 MiB   36
>> GiB  3.7 TiB  77.68  1.05  195      up
>> 564    hdd  18.00020   1.00000   16 TiB   13 TiB   13 TiB   5.1 MiB   36
>> GiB  3.6 TiB  78.09  1.05  196      up
>> 567    hdd  18.00020   1.00000   16 TiB   13 TiB   13 TiB   4.8 MiB   36
>> GiB  3.6 TiB  78.11  1.05  196      up
>> 568    hdd  18.00020   1.00000   16 TiB   13 TiB   13 TiB   5.2 MiB   35
>> GiB  3.7 TiB  77.68  1.05  195      up
>>
>> All OSDs are used by the same pool (EC)
>>
>> I have the same issue on another Ceph cluster with the same setup where
>> I was able to make OSDs utilization same by changing reweight from
>> 1.00000  to lower on OSDs with higher utilization and I got a lot of
>> free space:
>>
>> before changing reweight:
>>
>> --- RAW STORAGE ---
>> CLASS     SIZE    AVAIL     USED  RAW USED  %RAW USED
>> hdd    3.1 PiB  510 TiB  2.6 PiB   2.6 PiB      83.77
>> ssd    2.6 TiB  2.6 TiB   46 GiB    46 GiB       1.70
>> TOTAL  3.1 PiB  513 TiB  2.6 PiB   2.6 PiB      83.70
>>
>> --- POOLS ---
>> POOL                   ID   PGS   STORED  OBJECTS     USED  %USED  MAX
>> AVAIL
>> cephfs_data             3  8192  2.1 PiB  555.63M  2.6 PiB  91.02    216
>> TiB
>> cephfs_metadata         4   128  7.5 GiB  140.22k   22 GiB   0.87    851
>> GiB
>> device_health_metrics   5     1  4.1 GiB    1.15k  8.3 GiB      0    130
>> TiB
>>
>>
>> after changing reweight:
>> --- RAW STORAGE ---
>> CLASS     SIZE    AVAIL     USED  RAW USED  %RAW USED
>> hdd    3.1 PiB  522 TiB  2.6 PiB   2.6 PiB      83.38
>> ssd    2.6 TiB  2.6 TiB   63 GiB    63 GiB       2.36
>> TOTAL  3.1 PiB  525 TiB  2.6 PiB   2.6 PiB      83.31
>>
>> --- POOLS ---
>> POOL                   ID   PGS   STORED  OBJECTS     USED  %USED  MAX
>> AVAIL
>> cephfs_data             3  8192  2.1 PiB  555.63M  2.5 PiB  86.83    330
>> TiB
>> cephfs_metadata         4   128  7.4 GiB  140.22k   22 GiB   0.87    846
>> GiB
>> device_health_metrics   5     1  4.2 GiB    1.15k  8.4 GiB      0    198
>> TiB
>>
>> Free space I got is almost 5% what is about 100TB!
>>
>> This is just workaround and I'm not happy with keeping reweight with not
>> default value permanently.
>>
>> Do you have any advice please, what settings can be adjusted or should
>> be adjusted to keep OSDs utilization same? Because obviously balancer
>> upmap, not even crush-compat are working correctly at least in my case.
>>
>> Many thanks!
>>
>>
>>
>>
>>
>>
>>
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx