>> Very small and/or non-uniform clusters can be corner cases for many things, especially if they don’t have enough PGs. What is your failure domain — host or OSD? > > Failure domain is host, Your host buckets do vary in weight by roughly a factor of two. They naturally will get PGs more or less relative to their aggregate CRUSH weight, and thus also the OSDs on each host. > and PG number should be fairly reasonable. Reason is in the eye of the beholder. I make the PG ratio for the cluster as a whole to be ~~90. I would definitely add more, that should help. >> Are your OSDs sized uniformly? Please send the output of the following commands: > > OSDs are definitely not uniform in size. This might be the issue with > the automation. > > You asked for it, but I do apologize for the wall of text that follows... You described a small cluster, so this is peanuts. >> `ceph osd tree` > > ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF > -1 131.65762 root default > -25 16.46977 host k8s1 > 14 hdd 5.45799 osd.14 up 0.90002 1.00000 > 19 hdd 10.91409 osd.19 up 1.00000 1.00000 > 22 ssd 0.09769 osd.22 up 1.00000 1.00000 > -13 25.56458 host k8s3 > 2 hdd 10.91409 osd.2 up 0.84998 1.00000 > 3 hdd 1.81940 osd.3 up 0.75002 1.00000 > 20 hdd 12.73340 osd.20 up 1.00000 1.00000 > 10 ssd 0.09769 osd.10 up 1.00000 1.00000 > -14 12.83107 host k8s4 > 0 hdd 10.91399 osd.0 up 1.00000 1.00000 > 5 hdd 1.81940 osd.5 up 1.00000 1.00000 > 11 ssd 0.09769 osd.11 up 1.00000 1.00000 > -2 14.65048 host k8s5 > 1 hdd 1.81940 osd.1 up 0.70001 1.00000 > 17 hdd 12.73340 osd.17 up 1.00000 1.00000 > 12 ssd 0.09769 osd.12 up 1.00000 1.00000 > -6 14.65048 host k8s6 > 4 hdd 1.81940 osd.4 up 0.75000 1.00000 > 16 hdd 12.73340 osd.16 up 0.95001 1.00000 > 13 ssd 0.09769 osd.13 up 1.00000 1.00000 > -3 23.74518 host k8s7 > 6 hdd 12.73340 osd.6 up 1.00000 1.00000 > 15 hdd 10.91409 osd.15 up 0.95001 1.00000 > 8 ssd 0.09769 osd.8 up 1.00000 1.00000 > -9 23.74606 host k8s8 > 7 hdd 14.55269 osd.7 up 1.00000 1.00000 > 18 hdd 9.09569 osd.18 up 1.00000 1.00000 > 9 ssd 0.09769 osd.9 up 1.00000 1.00000 Looks like one 100GB SSD OSD per host? This is AIUI the screaming minimum size for an OSD. With WAL, DB, cluster maps, and other overhead there doesn’t end up being much space left for payload data. On larger OSDs the overhead is much more into the noise floor. Given the side of these SSD OSDs, I suspect at least one of the following is true? 1) They’re client aka desktop SSDs, not “enterprise” 2) They’re a partition of a larger OSD shared with other purposes I suspect that this alone would be enough to frustrate the balancer, which AFAIK doesn’t take overhead into consideration. You might disable the balancer module, reset the reweights to 1.00, and try the JJ balancer but I dunno that it would be night vs day. > Note this cluster is in the middle of re-creating all the OSDs to > modify the OSD allocation size min_alloc_size? Were they created on an older Ceph release? Current defaults for [non]rotational media are both 4KB; they used to be 64KB but were changed with some churn …. around the Pacific / Octopus era IIRC. If you’re re-creating to minimize space amp, does that mean you’re running RGW with a significant fraction of small objects? With RBD — or CephFS with larger files — that isn’t so much an issue. > I have scrubbing disabled since I'm > basically rewriting just about everything in the cluster weekly right > now but normally that would be on. > > cluster: > id: ba455d73-116e-4f24-8a34-a45e3ba9f44c > health: HEALTH_WARN > noscrub,nodeep-scrub flag(s) set > 546 pgs not deep-scrubbed in time > 542 pgs not scrubbed in time > > services: > mon: 3 daemons, quorum e,f,g (age 7d) > mgr: a(active, since 7d) > mds: 1/1 daemons up, 1 hot standby > osd: 22 osds: 22 up (since 5h), 22 in (since 33h); 101 remapped pgs > flags noscrub,nodeep-scrub > rgw: 1 daemon active (1 hosts, 1 zones) > > data: > volumes: 1/1 healthy > pools: 13 pools, 617 pgs > objects: 9.36M objects, 33 TiB > usage: 67 TiB used, 65 TiB / 132 TiB avail > pgs: 1778936/21708668 objects misplaced (8.195%) > 516 active+clean > 100 active+remapped+backfill_wait > 1 active+remapped+backfilling > > io: > client: 371 KiB/s rd, 2.8 MiB/s wr, 2 op/s rd, 7 op/s wr > recovery: 25 MiB/s, 6 objects/s > > progress: > Global Recovery Event (7d) > [=======================.....] (remaining: 36h) > >> `ceph osd df` > > Note that these are not in a steady state right now. OSD 6 in > particular was just re-created and is repopulating. A few of the > reweights were set to deal with some gross issues in balance - when it > all settles down I plan to optimize them. > > ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS > 14 hdd 5.45799 0.90002 5.5 TiB 3.0 TiB 3.0 TiB 2.0 MiB 11 GiB 2.4 TiB 55.51 1.09 72 up > 19 hdd 10.91409 1.00000 11 TiB 6.2 TiB 6.2 TiB 3.1 MiB 16 GiB 4.7 TiB 57.12 1.12 144 up Unless you were to carefully segregate larger and smaller HDDs into separate pools, right-sizing the PG could is tricky. 144 is okay, 72 is a bit low, upstream guidance notwithstanding. I would still bump some of your pg_nums a bit. > 22 ssd 0.09769 1.00000 100 GiB 2.4 GiB 1.8 GiB 167 MiB 504 MiB 98 GiB 2.43 0.05 32 up > 2 hdd 10.91409 0.84998 11 TiB 4.5 TiB 4.5 TiB 5.0 MiB 9.7 GiB 6.4 TiB 41.11 0.81 99 up > 3 hdd 1.81940 0.75002 1.8 TiB 1.0 TiB 1.0 TiB 2.3 MiB 3.8 GiB 818 GiB 56.11 1.10 21 up > 20 hdd 12.73340 1.00000 13 TiB 7.1 TiB 7.1 TiB 3.7 MiB 16 GiB 5.6 TiB 56.01 1.10 165 up > 10 ssd 0.09769 1.00000 100 GiB 1.3 GiB 299 MiB 185 MiB 835 MiB 99 GiB 1.29 0.03 38 up > 0 hdd 10.91399 1.00000 11 TiB 6.5 TiB 6.5 TiB 3.7 MiB 15 GiB 4.4 TiB 59.41 1.17 144 up > 5 hdd 1.81940 1.00000 1.8 TiB 845 GiB 842 GiB 1.7 MiB 3.3 GiB 1018 GiB 45.36 0.89 23 up > 11 ssd 0.09769 1.00000 100 GiB 3.1 GiB 1.3 GiB 157 MiB 1.6.GiB 97 GiB 3.09 0.06 33 up > 1 hdd 1.81940 0.70001 1.8 TiB 983 GiB 979 GiB 1.3 MiB 3.4. GiB 880 GiB 52.76 1.04 26 up > 17 hdd 12.73340 1.00000 13 TiB 7.3 TiB 7.2 TiB 3.6 MiB 15 GiB 5.5 TiB 56.95 1.12 159 up > 12 ssd 0.09769 1.00000 100 GiB 1.5 GiB 120 MiB 55 MiB 1.3 GiB 99 GiB 1.49 0.03 21 up > 4 hdd 1.81940 0.75000 1.8 TiB 1.0 TiB 1.0 TiB 2.5 MiB 3.0 GiB 820 GiB 55.98 1.10 24 up > 16 hdd 12.73340 0.95001 13 TiB 7.6 TiB 7.5 TiB 7.9 MiB 16 GiB 5.2 TiB 59.32 1.17 171 up > 13 ssd 0.09769 1.00000 100 GiB 2.4 GiB 528 MiB 196 MiB 1.7 GiB 98 GiB 2.38 0.05 33 up > 6 hdd 12.73340 1.00000 13 TiB 1.7 TiB 1.7 TiB 1.3 MiB 4.5 GiB 11 TiB 13.66 0.27 48 up > 15 hdd 10.91409 0.95001 11 TiB 6.5 TiB 6.5 TiB 5.2 MiB 13 GiB 4.4 TiB 59.42 1.17 155 up > 8 ssd 0.09769 1.00000 100 GiB 1.9 GiB 1.1 GiB 116 MiB 788 MiB 98 GiB 1.95 0.04 26 up > 7 hdd 14.55269 1.00000 15 TiB 7.8 TiB 7.7 TiB 3.9 MiB 16 GiB 6.8 TiB 53.32 1.05 172 up > 18 hdd 9.09569 1.00000 9.1 TiB 4.9 TiB 4.9 TiB 3.9 MiB 11 GiB 4.2 TiB 53.96 1.06 109 up > 9 ssd 0.09769 1.00000 100 GiB 2.2 GiB 391 MiB 264 MiB 1.6 GiB 98 GiB 2.25 0.04 40 up > TOTAL 132 TiB 67 TiB 67 TiB 1.2 GiB 164 GiB 65 TiB 50.82 > MIN/MAX VAR: 0.03/1.17 STDDEV: 29.78 > > >> `ceph osd dump | grep pool` > > pool 1 '.mgr' replicated size 3 min_size 2 crush_rule 7 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on pg_num_max 32 pg_num_min 1 application mgr Check the CRUSH rule for this pool. On my clusters Rook creates it without specifying a device-class, but the other pools get rules that do specify a device class. By way of the shadow CRUSH topology, this sort of looks like multiple CRUSH roots to the pg_autoscaler, which is why you have no output from the status below. I added a bit to the docs earlier this year to call this out. Perhaps the Rook folks on the list might have thoughts about preventing this situation, I don’t recall if I created a github issue for it. That said, I’m personally not a fan of the pg autoscaler and tend to disable it. ymmv. Unless you enable the “bulk” option, it may well be that you have too few PGs for effective bin packing. Think about filling a 55 gal drum with beach balls vs with golf balls. So many pools for such a small cluster …. are you actively using CephFS, RBD, *and* RGW? If not, I’d suggest removing whatever you aren’t using and adjusting pg_num for the pools you are using. > pool 2 'myfs-metadata' replicated size 3 min_size 2 crush_rule 25 object_hash rjenkins pg_num 16 pgp_num 16 > pool 3 'myfs-replicated' replicated size 2 min_size 1 crush_rule 26 object_hash rjenkins pg_num 256 pgp_num 256 > pool 4 'pvc-generic-pool' replicated size 3 min_size 2 crush_rule 17 object_hash rjenkins pg_num 128 pgp_num 128 > pool 13 'myfs-eck2m2' erasure profile myfs-eck2m2_ecprofile size 4 min_size 3 crush_rule 8 pg_num 128 pgp_num 128 > pool 22 'my-store.rgw.otp' replicated size 3 min_size 2 crush_rule 24 pg_num 8 pgp_num 8 > pool 23 'my-store.rgw.buckets.index' replicated size 3 min_size 2 pg_num 8 pgp_num 8 > pool 24 'my-store.rgw.log' replicated size 3 min_size 2 crush_rule 23 pg_num 8 pgp_num 8 > pool 25 'my-store.rgw.control' replicated size 3 min_size 2 crush_rule 19 object_hash rjenkins pg_num 8 pgp_num 8 > pool 26 '.rgw.root' replicated size 3 min_size 2 crush_rule 18 pg_num 8 pgp_num 8 > pool 27 'my-store.rgw.buckets.non-ec' replicated size 3 min_size 2 pg_num 8 pgp_num 8 > pool 28 'my-store.rgw.meta' replicated size 3 min_size 2 pg_num 8 pgp_num 8 > pool 29 'my-store.rgw.buckets.data' erasure profile my-store.rgw.buckets.data_ecprofile size 4 min_size 3 pg_num 32 pgp_num 32 autoscale_mode on Is that a 2,2 or 3,1 profile? > >> `ceph balancer status` > > This does have normal output when the cluster isn't in the middle of recovery. > > { > "active": true, > "last_optimize_duration": "0:00:00.000107", > "last_optimize_started": "Tue Nov 28 22:11:56 2023", > "mode": "upmap", > "no_optimization_needed": true, > "optimize_result": "Too many objects (0.081907 > 0.050000) are > misplaced; try again later", > "plans": [] > } > >> `ceph osd pool autoscale-status` > > No output for this. I'm not sure why See above, I suspected this. > - this has given output in the > past. Might be due to being in the middle of recovery, or it might be > a Reef issue (I don't think I've looked at this since upgrading). In > any case, PG counts are in the osd dump, and I have the hdd storage > classes set to warn I think. > >> The balancer module can be confounded by certain complex topologies like multiple device classes and/or CRUSH roots. >> >> Since you’re using Rook, I wonder if you might be hitting something that I’ve seen myself; the above commands will tell the tale. > > Yeah, if it is designed for equally-sized OSDs then it isn't going to > work quite right for me. I do try to keep hosts reasonably balanced, > but not individual OSDs. Ceph is fantastic for flexibility, but it’s not above giving us enough rope to hang ourselves with. > > -- > Rich _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx