OSDs are not utilized evenly

Denis Polom <denispolom@xxxxxxxxx> · Tue, 1 Nov 2022 13:45:27 +0100

Hi

I observed on my Ceph cluster running latest Pacific that same size OSDs 
are utilized differently even if balancer is running and reports status 
as perfectly balanced.

{
    "active": true,
    "last_optimize_duration": "0:00:00.622467",
    "last_optimize_started": "Tue Nov  1 12:49:36 2022",
    "mode": "upmap",
    "optimize_result": "Unable to find further optimization, or pool(s) 
pg_num is decreasing, or distribution is already perfect",
    "plans": []
}

balancer settings for upmap are:

  mgr           advanced 
mgr/balancer/mode                               upmap
  mgr           advanced mgr/balancer/upmap_max_deviation                1
  mgr           advanced mgr/balancer/upmap_max_optimizations            20

It's obvious that utilization is not same (difference is about 1TB) from 
command `ceph osd df`. Following is just a partial output:

ID   CLASS  WEIGHT    REWEIGHT  SIZE     RAW USE  DATA OMAP      
META     AVAIL    %USE   VAR   PGS  STATUS
  0    hdd  18.00020   1.00000   16 TiB   13 TiB   13 TiB   3.0 MiB   
37 GiB  3.6 TiB  78.09  1.05  196      up
124    hdd  18.00020   1.00000   16 TiB   12 TiB   12 TiB       0 B   32 
GiB  4.7 TiB  71.20  0.96  195      up
157    hdd  18.00020   1.00000   16 TiB   13 TiB   13 TiB   5.3 MiB   35 
GiB  3.7 TiB  77.67  1.05  195      up
  1    hdd  18.00020   1.00000   16 TiB   13 TiB   13 TiB   2.0 MiB   
35 GiB  3.7 TiB  77.69  1.05  195      up
243    hdd  18.00020   1.00000   16 TiB   12 TiB   12 TiB       0 B   31 
GiB  4.7 TiB  71.16  0.96  195      up
244    hdd  18.00020   1.00000   16 TiB   12 TiB   12 TiB       0 B   31 
GiB  4.7 TiB  71.19  0.96  195      up
245    hdd  18.00020   1.00000   16 TiB   12 TiB   12 TiB       0 B   32 
GiB  4.7 TiB  71.55  0.96  196      up
246    hdd  18.00020   1.00000   16 TiB   12 TiB   12 TiB       0 B   31 
GiB  4.7 TiB  71.17  0.96  195      up
249    hdd  18.00020   1.00000   16 TiB   12 TiB   12 TiB       0 B   30 
GiB  4.7 TiB  71.18  0.96  195      up
500    hdd  18.00020   1.00000   16 TiB   12 TiB   12 TiB       0 B   30 
GiB  4.7 TiB  71.19  0.96  195      up
501    hdd  18.00020   1.00000   16 TiB   12 TiB   12 TiB       0 B   31 
GiB  4.7 TiB  71.57  0.96  196      up
502    hdd  18.00020   1.00000   16 TiB   12 TiB   12 TiB       0 B   31 
GiB  4.7 TiB  71.18  0.96  195      up
532    hdd  18.00020   1.00000   16 TiB   12 TiB   12 TiB       0 B   31 
GiB  4.7 TiB  71.16  0.96  195      up
549    hdd  18.00020   1.00000   16 TiB   13 TiB   13 TiB   576 KiB   36 
GiB  3.7 TiB  77.70  1.05  195      up
550    hdd  18.00020   1.00000   16 TiB   13 TiB   13 TiB   3.8 MiB   36 
GiB  3.7 TiB  77.67  1.05  195      up
551    hdd  18.00020   1.00000   16 TiB   13 TiB   13 TiB   2.4 MiB   35 
GiB  3.7 TiB  77.68  1.05  195      up
552    hdd  18.00020   1.00000   16 TiB   13 TiB   13 TiB   5.5 MiB   35 
GiB  3.7 TiB  77.69  1.05  195      up
553    hdd  18.00020   1.00000   16 TiB   13 TiB   13 TiB   5.1 MiB   37 
GiB  3.6 TiB  77.71  1.05  195      up
554    hdd  18.00020   1.00000   16 TiB   13 TiB   13 TiB   967 KiB   36 
GiB  3.6 TiB  77.71  1.05  195      up
555    hdd  18.00020   1.00000   16 TiB   13 TiB   13 TiB   1.3 MiB   36 
GiB  3.6 TiB  78.08  1.05  196      up
556    hdd  18.00020   1.00000   16 TiB   13 TiB   13 TiB   4.7 MiB   36 
GiB  3.6 TiB  78.10  1.05  196      up
557    hdd  18.00020   1.00000   16 TiB   13 TiB   13 TiB   2.4 MiB   36 
GiB  3.7 TiB  77.69  1.05  195      up
558    hdd  18.00020   1.00000   16 TiB   13 TiB   13 TiB   4.5 MiB   36 
GiB  3.6 TiB  77.72  1.05  195      up
559    hdd  18.00020   1.00000   16 TiB   13 TiB   13 TiB   1.5 MiB   35 
GiB  3.6 TiB  78.09  1.05  196      up
560    hdd  18.00020   1.00000   16 TiB   13 TiB   13 TiB   5.2 MiB   35 
GiB  3.7 TiB  77.69  1.05  195      up
561    hdd  18.00020   1.00000   16 TiB   13 TiB   13 TiB   2.8 MiB   35 
GiB  3.7 TiB  77.69  1.05  195      up
562    hdd  18.00020   1.00000   16 TiB   13 TiB   13 TiB   1.0 MiB   36 
GiB  3.7 TiB  77.68  1.05  195      up
563    hdd  18.00020   1.00000   16 TiB   13 TiB   13 TiB   2.6 MiB   36 
GiB  3.7 TiB  77.68  1.05  195      up
564    hdd  18.00020   1.00000   16 TiB   13 TiB   13 TiB   5.1 MiB   36 
GiB  3.6 TiB  78.09  1.05  196      up
567    hdd  18.00020   1.00000   16 TiB   13 TiB   13 TiB   4.8 MiB   36 
GiB  3.6 TiB  78.11  1.05  196      up
568    hdd  18.00020   1.00000   16 TiB   13 TiB   13 TiB   5.2 MiB   35 
GiB  3.7 TiB  77.68  1.05  195      up

All OSDs are used by the same pool (EC)

I have the same issue on another Ceph cluster with the same setup where 
I was able to make OSDs utilization same by changing reweight from 
1.00000  to lower on OSDs with higher utilization and I got a lot of 
free space:

before changing reweight:

--- RAW STORAGE ---
CLASS     SIZE    AVAIL     USED  RAW USED  %RAW USED
hdd    3.1 PiB  510 TiB  2.6 PiB   2.6 PiB      83.77
ssd    2.6 TiB  2.6 TiB   46 GiB    46 GiB       1.70
TOTAL  3.1 PiB  513 TiB  2.6 PiB   2.6 PiB      83.70

--- POOLS ---
POOL                   ID   PGS   STORED  OBJECTS     USED  %USED  MAX AVAIL
cephfs_data             3  8192  2.1 PiB  555.63M  2.6 PiB  91.02    216 TiB
cephfs_metadata         4   128  7.5 GiB  140.22k   22 GiB   0.87    851 GiB
device_health_metrics   5     1  4.1 GiB    1.15k  8.3 GiB      0    130 TiB

after changing reweight:
--- RAW STORAGE ---
CLASS     SIZE    AVAIL     USED  RAW USED  %RAW USED
hdd    3.1 PiB  522 TiB  2.6 PiB   2.6 PiB      83.38
ssd    2.6 TiB  2.6 TiB   63 GiB    63 GiB       2.36
TOTAL  3.1 PiB  525 TiB  2.6 PiB   2.6 PiB      83.31

--- POOLS ---
POOL                   ID   PGS   STORED  OBJECTS     USED  %USED  MAX AVAIL
cephfs_data             3  8192  2.1 PiB  555.63M  2.5 PiB  86.83    330 TiB
cephfs_metadata         4   128  7.4 GiB  140.22k   22 GiB   0.87    846 GiB
device_health_metrics   5     1  4.2 GiB    1.15k  8.4 GiB      0    198 TiB

Free space I got is almost 5% what is about 100TB!

This is just workaround and I'm not happy with keeping reweight with not 
default value permanently.

Do you have any advice please, what settings can be adjusted or should 
be adjusted to keep OSDs utilization same? Because obviously balancer 
upmap, not even crush-compat are working correctly at least in my case.

Many thanks!

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx