Re: Best Practice for OSD Balancing

Rich Freeman <r-ceph@xxxxxxxxxxxx> · Tue, 28 Nov 2023 22:20:02 +0000

On Tue, Nov 28, 2023 at 3:52 PM Anthony D'Atri <aad@xxxxxxxxxxxxxx> wrote:
>
> Very small and/or non-uniform clusters can be corner cases for many things, especially if they don’t have enough PGs.  What is your failure domain — host or OSD?

Failure domain is host, and PG number should be fairly reasonable.

>
> Are your OSDs sized uniformly?  Please send the output of the following commands:

OSDs are definitely not uniform in size.  This might be the issue with
the automation.

You asked for it, but I do apologize for the wall of text that follows...

>
> `ceph osd tree`

ID   CLASS  WEIGHT     TYPE NAME        STATUS  REWEIGHT  PRI-AFF
 -1         131.65762  root default
-25          16.46977      host k8s1
 14    hdd    5.45799          osd.14       up   0.90002  1.00000
 19    hdd   10.91409          osd.19       up   1.00000  1.00000
 22    ssd    0.09769          osd.22       up   1.00000  1.00000
-13          25.56458      host k8s3
  2    hdd   10.91409          osd.2        up   0.84998  1.00000
  3    hdd    1.81940          osd.3        up   0.75002  1.00000
 20    hdd   12.73340          osd.20       up   1.00000  1.00000
 10    ssd    0.09769          osd.10       up   1.00000  1.00000
-14          12.83107      host k8s4
  0    hdd   10.91399          osd.0        up   1.00000  1.00000
  5    hdd    1.81940          osd.5        up   1.00000  1.00000
 11    ssd    0.09769          osd.11       up   1.00000  1.00000
 -2          14.65048      host k8s5
  1    hdd    1.81940          osd.1        up   0.70001  1.00000
 17    hdd   12.73340          osd.17       up   1.00000  1.00000
 12    ssd    0.09769          osd.12       up   1.00000  1.00000
 -6          14.65048      host k8s6
  4    hdd    1.81940          osd.4        up   0.75000  1.00000
 16    hdd   12.73340          osd.16       up   0.95001  1.00000
 13    ssd    0.09769          osd.13       up   1.00000  1.00000
 -3          23.74518      host k8s7
  6    hdd   12.73340          osd.6        up   1.00000  1.00000
 15    hdd   10.91409          osd.15       up   0.95001  1.00000
  8    ssd    0.09769          osd.8        up   1.00000  1.00000
 -9          23.74606      host k8s8
  7    hdd   14.55269          osd.7        up   1.00000  1.00000
 18    hdd    9.09569          osd.18       up   1.00000  1.00000
  9    ssd    0.09769          osd.9        up   1.00000  1.00000

>
> so that we can see the topology.
>
> `ceph -s`

Note this cluster is in the middle of re-creating all the OSDs to
modify the OSD allocation size - I have scrubbing disabled since I'm
basically rewriting just about everything in the cluster weekly right
now but normally that would be on.

  cluster:
    id:     ba455d73-116e-4f24-8a34-a45e3ba9f44c
    health: HEALTH_WARN
            noscrub,nodeep-scrub flag(s) set
            546 pgs not deep-scrubbed in time
            542 pgs not scrubbed in time

  services:
    mon: 3 daemons, quorum e,f,g (age 7d)
    mgr: a(active, since 7d)
    mds: 1/1 daemons up, 1 hot standby
    osd: 22 osds: 22 up (since 5h), 22 in (since 33h); 101 remapped pgs
         flags noscrub,nodeep-scrub
    rgw: 1 daemon active (1 hosts, 1 zones)

  data:
    volumes: 1/1 healthy
    pools:   13 pools, 617 pgs
    objects: 9.36M objects, 33 TiB
    usage:   67 TiB used, 65 TiB / 132 TiB avail
    pgs:     1778936/21708668 objects misplaced (8.195%)
             516 active+clean
             100 active+remapped+backfill_wait
             1   active+remapped+backfilling

  io:
    client:   371 KiB/s rd, 2.8 MiB/s wr, 2 op/s rd, 7 op/s wr
    recovery: 25 MiB/s, 6 objects/s

  progress:
    Global Recovery Event (7d)
      [=======================.....] (remaining: 36h)

> `ceph osd df`

Note that these are not in a steady state right now.  OSD 6 in
particular was just re-created and is repopulating.  A few of the
reweights were set to deal with some gross issues in balance - when it
all settles down I plan to optimize them.

ID  CLASS  WEIGHT    REWEIGHT  SIZE     RAW USE  DATA     OMAP
META     AVAIL     %USE   VAR   PGS  STATUS
14    hdd   5.45799   0.90002  5.5 TiB  3.0 TiB  3.0 TiB  2.0 MiB   11
GiB   2.4 TiB  55.51  1.09   72      up
19    hdd  10.91409   1.00000   11 TiB  6.2 TiB  6.2 TiB  3.1 MiB   16
GiB   4.7 TiB  57.12  1.12  144      up
22    ssd   0.09769   1.00000  100 GiB  2.4 GiB  1.8 GiB  167 MiB  504
MiB    98 GiB   2.43  0.05   32      up
 2    hdd  10.91409   0.84998   11 TiB  4.5 TiB  4.5 TiB  5.0 MiB  9.7
GiB   6.4 TiB  41.11  0.81   99      up
 3    hdd   1.81940   0.75002  1.8 TiB  1.0 TiB  1.0 TiB  2.3 MiB  3.8
GiB   818 GiB  56.11  1.10   21      up
20    hdd  12.73340   1.00000   13 TiB  7.1 TiB  7.1 TiB  3.7 MiB   16
GiB   5.6 TiB  56.01  1.10  165      up
10    ssd   0.09769   1.00000  100 GiB  1.3 GiB  299 MiB  185 MiB  835
MiB    99 GiB   1.29  0.03   38      up
 0    hdd  10.91399   1.00000   11 TiB  6.5 TiB  6.5 TiB  3.7 MiB   15
GiB   4.4 TiB  59.41  1.17  144      up
 5    hdd   1.81940   1.00000  1.8 TiB  845 GiB  842 GiB  1.7 MiB  3.3
GiB  1018 GiB  45.36  0.89   23      up
11    ssd   0.09769   1.00000  100 GiB  3.1 GiB  1.3 GiB  157 MiB  1.6
GiB    97 GiB   3.09  0.06   33      up
 1    hdd   1.81940   0.70001  1.8 TiB  983 GiB  979 GiB  1.3 MiB  3.4
GiB   880 GiB  52.76  1.04   26      up
17    hdd  12.73340   1.00000   13 TiB  7.3 TiB  7.2 TiB  3.6 MiB   15
GiB   5.5 TiB  56.95  1.12  159      up
12    ssd   0.09769   1.00000  100 GiB  1.5 GiB  120 MiB   55 MiB  1.3
GiB    99 GiB   1.49  0.03   21      up
 4    hdd   1.81940   0.75000  1.8 TiB  1.0 TiB  1.0 TiB  2.5 MiB  3.0
GiB   820 GiB  55.98  1.10   24      up
16    hdd  12.73340   0.95001   13 TiB  7.6 TiB  7.5 TiB  7.9 MiB   16
GiB   5.2 TiB  59.32  1.17  171      up
13    ssd   0.09769   1.00000  100 GiB  2.4 GiB  528 MiB  196 MiB  1.7
GiB    98 GiB   2.38  0.05   33      up
 6    hdd  12.73340   1.00000   13 TiB  1.7 TiB  1.7 TiB  1.3 MiB  4.5
GiB    11 TiB  13.66  0.27   48      up
15    hdd  10.91409   0.95001   11 TiB  6.5 TiB  6.5 TiB  5.2 MiB   13
GiB   4.4 TiB  59.42  1.17  155      up
 8    ssd   0.09769   1.00000  100 GiB  1.9 GiB  1.1 GiB  116 MiB  788
MiB    98 GiB   1.95  0.04   26      up
 7    hdd  14.55269   1.00000   15 TiB  7.8 TiB  7.7 TiB  3.9 MiB   16
GiB   6.8 TiB  53.32  1.05  172      up
18    hdd   9.09569   1.00000  9.1 TiB  4.9 TiB  4.9 TiB  3.9 MiB   11
GiB   4.2 TiB  53.96  1.06  109      up
 9    ssd   0.09769   1.00000  100 GiB  2.2 GiB  391 MiB  264 MiB  1.6
GiB    98 GiB   2.25  0.04   40      up
                        TOTAL  132 TiB   67 TiB   67 TiB  1.2 GiB  164
GiB    65 TiB  50.82
MIN/MAX VAR: 0.03/1.17  STDDEV: 29.78

> `ceph osd dump | grep pool`

pool 1 '.mgr' replicated size 3 min_size 2 crush_rule 7 object_hash
rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 12539 flags
hashpspool stripe_width 0 pg_num_max 32 pg_num_min 1 application mgr
read_balance_score 6.98
pool 2 'myfs-metadata' replicated size 3 min_size 2 crush_rule 25
object_hash rjenkins pg_num 16 pgp_num 16 autoscale_mode on
last_change 32432 lfor 0/0/31 flags hashpspool stripe_width 0
pg_autoscale_bias 4 pg_num_min 16 recovery_priority 5 application
cephfs read_balance_score 2.19
pool 3 'myfs-replicated' replicated size 2 min_size 1 crush_rule 26
object_hash rjenkins pg_num 256 pgp_num 256 autoscale_mode warn
last_change 32511 lfor 0/21361/21359 flags
hashpspool,selfmanaged_snaps stripe_width 0 application cephfs
read_balance_score 1.99
pool 4 'pvc-generic-pool' replicated size 3 min_size 2 crush_rule 17
object_hash rjenkins pg_num 128 pgp_num 128 autoscale_mode on
last_change 32586 lfor 0/0/5211 flags hashpspool,selfmanaged_snaps
stripe_width 0 application rbd read_balance_score 3.26
pool 13 'myfs-eck2m2' erasure profile myfs-eck2m2_ecprofile size 4
min_size 3 crush_rule 8 object_hash rjenkins pg_num 128 pgp_num 128
autoscale_mode warn last_change 32511 lfor 0/8517/8518 flags
hashpspool,ec_overwrites,selfmanaged_snaps stripe_width 8192
application cephfs
pool 22 'my-store.rgw.otp' replicated size 3 min_size 2 crush_rule 24
object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode on last_change
32431 flags hashpspool stripe_width 0 pg_num_min 8 application
rook-ceph-rgw read_balance_score 1.75
pool 23 'my-store.rgw.buckets.index' replicated size 3 min_size 2
crush_rule 22 object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode
on last_change 32431 flags hashpspool stripe_width 0 pg_num_min 8
application rook-ceph-rgw read_balance_score 2.63
pool 24 'my-store.rgw.log' replicated size 3 min_size 2 crush_rule 23
object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode on last_change
32431 flags hashpspool stripe_width 0 pg_num_min 8 application
rook-ceph-rgw read_balance_score 2.63
pool 25 'my-store.rgw.control' replicated size 3 min_size 2 crush_rule
19 object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode on
last_change 32432 flags hashpspool stripe_width 0 pg_num_min 8
application rook-ceph-rgw read_balance_score 1.75
pool 26 '.rgw.root' replicated size 3 min_size 2 crush_rule 18
object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode on last_change
32432 flags hashpspool stripe_width 0 pg_num_min 8 application
rook-ceph-rgw read_balance_score 3.50
pool 27 'my-store.rgw.buckets.non-ec' replicated size 3 min_size 2
crush_rule 20 object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode
on last_change 32431 flags hashpspool stripe_width 0 pg_num_min 8
application rook-ceph-rgw read_balance_score 2.62
pool 28 'my-store.rgw.meta' replicated size 3 min_size 2 crush_rule 21
object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode on last_change
32431 flags hashpspool stripe_width 0 pg_num_min 8 application
rook-ceph-rgw read_balance_score 1.74
pool 29 'my-store.rgw.buckets.data' erasure profile
my-store.rgw.buckets.data_ecprofile size 4 min_size 3 crush_rule 16
object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on
last_change 32433 lfor 0/0/13673 flags hashpspool,ec_overwrites
stripe_width 8192 application rook-ceph-rgw

> `ceph balancer status`

This does have normal output when the cluster isn't in the middle of recovery.

{
    "active": true,
    "last_optimize_duration": "0:00:00.000107",
    "last_optimize_started": "Tue Nov 28 22:11:56 2023",
    "mode": "upmap",
    "no_optimization_needed": true,
    "optimize_result": "Too many objects (0.081907 > 0.050000) are
misplaced; try again later",
    "plans": []
}

> `ceph osd pool autoscale-status`

No output for this.  I'm not sure why - this has given output in the
past.  Might be due to being in the middle of recovery, or it might be
a Reef issue (I don't think I've looked at this since upgrading).  In
any case, PG counts are in the osd dump, and I have the hdd storage
classes set to warn I think.

> The balancer module can be confounded by certain complex topologies like multiple device classes and/or CRUSH roots.
>
> Since you’re using Rook, I wonder if you might be hitting something that I’ve seen myself; the above commands will tell the tale.

Yeah, if it is designed for equally-sized OSDs then it isn't going to
work quite right for me.  I do try to keep hosts reasonably balanced,
but not individual OSDs.

--
Rich
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx