On Tue, Nov 28, 2023 at 3:52 PM Anthony D'Atri <aad@xxxxxxxxxxxxxx> wrote: > > Very small and/or non-uniform clusters can be corner cases for many things, especially if they don’t have enough PGs. What is your failure domain — host or OSD? Failure domain is host, and PG number should be fairly reasonable. > > Are your OSDs sized uniformly? Please send the output of the following commands: OSDs are definitely not uniform in size. This might be the issue with the automation. You asked for it, but I do apologize for the wall of text that follows... > > `ceph osd tree` ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 131.65762 root default -25 16.46977 host k8s1 14 hdd 5.45799 osd.14 up 0.90002 1.00000 19 hdd 10.91409 osd.19 up 1.00000 1.00000 22 ssd 0.09769 osd.22 up 1.00000 1.00000 -13 25.56458 host k8s3 2 hdd 10.91409 osd.2 up 0.84998 1.00000 3 hdd 1.81940 osd.3 up 0.75002 1.00000 20 hdd 12.73340 osd.20 up 1.00000 1.00000 10 ssd 0.09769 osd.10 up 1.00000 1.00000 -14 12.83107 host k8s4 0 hdd 10.91399 osd.0 up 1.00000 1.00000 5 hdd 1.81940 osd.5 up 1.00000 1.00000 11 ssd 0.09769 osd.11 up 1.00000 1.00000 -2 14.65048 host k8s5 1 hdd 1.81940 osd.1 up 0.70001 1.00000 17 hdd 12.73340 osd.17 up 1.00000 1.00000 12 ssd 0.09769 osd.12 up 1.00000 1.00000 -6 14.65048 host k8s6 4 hdd 1.81940 osd.4 up 0.75000 1.00000 16 hdd 12.73340 osd.16 up 0.95001 1.00000 13 ssd 0.09769 osd.13 up 1.00000 1.00000 -3 23.74518 host k8s7 6 hdd 12.73340 osd.6 up 1.00000 1.00000 15 hdd 10.91409 osd.15 up 0.95001 1.00000 8 ssd 0.09769 osd.8 up 1.00000 1.00000 -9 23.74606 host k8s8 7 hdd 14.55269 osd.7 up 1.00000 1.00000 18 hdd 9.09569 osd.18 up 1.00000 1.00000 9 ssd 0.09769 osd.9 up 1.00000 1.00000 > > so that we can see the topology. > > `ceph -s` Note this cluster is in the middle of re-creating all the OSDs to modify the OSD allocation size - I have scrubbing disabled since I'm basically rewriting just about everything in the cluster weekly right now but normally that would be on. cluster: id: ba455d73-116e-4f24-8a34-a45e3ba9f44c health: HEALTH_WARN noscrub,nodeep-scrub flag(s) set 546 pgs not deep-scrubbed in time 542 pgs not scrubbed in time services: mon: 3 daemons, quorum e,f,g (age 7d) mgr: a(active, since 7d) mds: 1/1 daemons up, 1 hot standby osd: 22 osds: 22 up (since 5h), 22 in (since 33h); 101 remapped pgs flags noscrub,nodeep-scrub rgw: 1 daemon active (1 hosts, 1 zones) data: volumes: 1/1 healthy pools: 13 pools, 617 pgs objects: 9.36M objects, 33 TiB usage: 67 TiB used, 65 TiB / 132 TiB avail pgs: 1778936/21708668 objects misplaced (8.195%) 516 active+clean 100 active+remapped+backfill_wait 1 active+remapped+backfilling io: client: 371 KiB/s rd, 2.8 MiB/s wr, 2 op/s rd, 7 op/s wr recovery: 25 MiB/s, 6 objects/s progress: Global Recovery Event (7d) [=======================.....] (remaining: 36h) > `ceph osd df` Note that these are not in a steady state right now. OSD 6 in particular was just re-created and is repopulating. A few of the reweights were set to deal with some gross issues in balance - when it all settles down I plan to optimize them. ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS 14 hdd 5.45799 0.90002 5.5 TiB 3.0 TiB 3.0 TiB 2.0 MiB 11 GiB 2.4 TiB 55.51 1.09 72 up 19 hdd 10.91409 1.00000 11 TiB 6.2 TiB 6.2 TiB 3.1 MiB 16 GiB 4.7 TiB 57.12 1.12 144 up 22 ssd 0.09769 1.00000 100 GiB 2.4 GiB 1.8 GiB 167 MiB 504 MiB 98 GiB 2.43 0.05 32 up 2 hdd 10.91409 0.84998 11 TiB 4.5 TiB 4.5 TiB 5.0 MiB 9.7 GiB 6.4 TiB 41.11 0.81 99 up 3 hdd 1.81940 0.75002 1.8 TiB 1.0 TiB 1.0 TiB 2.3 MiB 3.8 GiB 818 GiB 56.11 1.10 21 up 20 hdd 12.73340 1.00000 13 TiB 7.1 TiB 7.1 TiB 3.7 MiB 16 GiB 5.6 TiB 56.01 1.10 165 up 10 ssd 0.09769 1.00000 100 GiB 1.3 GiB 299 MiB 185 MiB 835 MiB 99 GiB 1.29 0.03 38 up 0 hdd 10.91399 1.00000 11 TiB 6.5 TiB 6.5 TiB 3.7 MiB 15 GiB 4.4 TiB 59.41 1.17 144 up 5 hdd 1.81940 1.00000 1.8 TiB 845 GiB 842 GiB 1.7 MiB 3.3 GiB 1018 GiB 45.36 0.89 23 up 11 ssd 0.09769 1.00000 100 GiB 3.1 GiB 1.3 GiB 157 MiB 1.6 GiB 97 GiB 3.09 0.06 33 up 1 hdd 1.81940 0.70001 1.8 TiB 983 GiB 979 GiB 1.3 MiB 3.4 GiB 880 GiB 52.76 1.04 26 up 17 hdd 12.73340 1.00000 13 TiB 7.3 TiB 7.2 TiB 3.6 MiB 15 GiB 5.5 TiB 56.95 1.12 159 up 12 ssd 0.09769 1.00000 100 GiB 1.5 GiB 120 MiB 55 MiB 1.3 GiB 99 GiB 1.49 0.03 21 up 4 hdd 1.81940 0.75000 1.8 TiB 1.0 TiB 1.0 TiB 2.5 MiB 3.0 GiB 820 GiB 55.98 1.10 24 up 16 hdd 12.73340 0.95001 13 TiB 7.6 TiB 7.5 TiB 7.9 MiB 16 GiB 5.2 TiB 59.32 1.17 171 up 13 ssd 0.09769 1.00000 100 GiB 2.4 GiB 528 MiB 196 MiB 1.7 GiB 98 GiB 2.38 0.05 33 up 6 hdd 12.73340 1.00000 13 TiB 1.7 TiB 1.7 TiB 1.3 MiB 4.5 GiB 11 TiB 13.66 0.27 48 up 15 hdd 10.91409 0.95001 11 TiB 6.5 TiB 6.5 TiB 5.2 MiB 13 GiB 4.4 TiB 59.42 1.17 155 up 8 ssd 0.09769 1.00000 100 GiB 1.9 GiB 1.1 GiB 116 MiB 788 MiB 98 GiB 1.95 0.04 26 up 7 hdd 14.55269 1.00000 15 TiB 7.8 TiB 7.7 TiB 3.9 MiB 16 GiB 6.8 TiB 53.32 1.05 172 up 18 hdd 9.09569 1.00000 9.1 TiB 4.9 TiB 4.9 TiB 3.9 MiB 11 GiB 4.2 TiB 53.96 1.06 109 up 9 ssd 0.09769 1.00000 100 GiB 2.2 GiB 391 MiB 264 MiB 1.6 GiB 98 GiB 2.25 0.04 40 up TOTAL 132 TiB 67 TiB 67 TiB 1.2 GiB 164 GiB 65 TiB 50.82 MIN/MAX VAR: 0.03/1.17 STDDEV: 29.78 > `ceph osd dump | grep pool` pool 1 '.mgr' replicated size 3 min_size 2 crush_rule 7 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 12539 flags hashpspool stripe_width 0 pg_num_max 32 pg_num_min 1 application mgr read_balance_score 6.98 pool 2 'myfs-metadata' replicated size 3 min_size 2 crush_rule 25 object_hash rjenkins pg_num 16 pgp_num 16 autoscale_mode on last_change 32432 lfor 0/0/31 flags hashpspool stripe_width 0 pg_autoscale_bias 4 pg_num_min 16 recovery_priority 5 application cephfs read_balance_score 2.19 pool 3 'myfs-replicated' replicated size 2 min_size 1 crush_rule 26 object_hash rjenkins pg_num 256 pgp_num 256 autoscale_mode warn last_change 32511 lfor 0/21361/21359 flags hashpspool,selfmanaged_snaps stripe_width 0 application cephfs read_balance_score 1.99 pool 4 'pvc-generic-pool' replicated size 3 min_size 2 crush_rule 17 object_hash rjenkins pg_num 128 pgp_num 128 autoscale_mode on last_change 32586 lfor 0/0/5211 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd read_balance_score 3.26 pool 13 'myfs-eck2m2' erasure profile myfs-eck2m2_ecprofile size 4 min_size 3 crush_rule 8 object_hash rjenkins pg_num 128 pgp_num 128 autoscale_mode warn last_change 32511 lfor 0/8517/8518 flags hashpspool,ec_overwrites,selfmanaged_snaps stripe_width 8192 application cephfs pool 22 'my-store.rgw.otp' replicated size 3 min_size 2 crush_rule 24 object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode on last_change 32431 flags hashpspool stripe_width 0 pg_num_min 8 application rook-ceph-rgw read_balance_score 1.75 pool 23 'my-store.rgw.buckets.index' replicated size 3 min_size 2 crush_rule 22 object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode on last_change 32431 flags hashpspool stripe_width 0 pg_num_min 8 application rook-ceph-rgw read_balance_score 2.63 pool 24 'my-store.rgw.log' replicated size 3 min_size 2 crush_rule 23 object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode on last_change 32431 flags hashpspool stripe_width 0 pg_num_min 8 application rook-ceph-rgw read_balance_score 2.63 pool 25 'my-store.rgw.control' replicated size 3 min_size 2 crush_rule 19 object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode on last_change 32432 flags hashpspool stripe_width 0 pg_num_min 8 application rook-ceph-rgw read_balance_score 1.75 pool 26 '.rgw.root' replicated size 3 min_size 2 crush_rule 18 object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode on last_change 32432 flags hashpspool stripe_width 0 pg_num_min 8 application rook-ceph-rgw read_balance_score 3.50 pool 27 'my-store.rgw.buckets.non-ec' replicated size 3 min_size 2 crush_rule 20 object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode on last_change 32431 flags hashpspool stripe_width 0 pg_num_min 8 application rook-ceph-rgw read_balance_score 2.62 pool 28 'my-store.rgw.meta' replicated size 3 min_size 2 crush_rule 21 object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode on last_change 32431 flags hashpspool stripe_width 0 pg_num_min 8 application rook-ceph-rgw read_balance_score 1.74 pool 29 'my-store.rgw.buckets.data' erasure profile my-store.rgw.buckets.data_ecprofile size 4 min_size 3 crush_rule 16 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 32433 lfor 0/0/13673 flags hashpspool,ec_overwrites stripe_width 8192 application rook-ceph-rgw > `ceph balancer status` This does have normal output when the cluster isn't in the middle of recovery. { "active": true, "last_optimize_duration": "0:00:00.000107", "last_optimize_started": "Tue Nov 28 22:11:56 2023", "mode": "upmap", "no_optimization_needed": true, "optimize_result": "Too many objects (0.081907 > 0.050000) are misplaced; try again later", "plans": [] } > `ceph osd pool autoscale-status` No output for this. I'm not sure why - this has given output in the past. Might be due to being in the middle of recovery, or it might be a Reef issue (I don't think I've looked at this since upgrading). In any case, PG counts are in the osd dump, and I have the hdd storage classes set to warn I think. > The balancer module can be confounded by certain complex topologies like multiple device classes and/or CRUSH roots. > > Since you’re using Rook, I wonder if you might be hitting something that I’ve seen myself; the above commands will tell the tale. Yeah, if it is designed for equally-sized OSDs then it isn't going to work quite right for me. I do try to keep hosts reasonably balanced, but not individual OSDs. -- Rich _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx