On Tue, Nov 28, 2023 at 6:25 PM Anthony D'Atri <aad@xxxxxxxxxxxxxx> wrote: > Looks like one 100GB SSD OSD per host? This is AIUI the screaming minimum size for an OSD. With WAL, DB, cluster maps, and other overhead there doesn’t end up being much space left for payload data. On larger OSDs the overhead is much more into the noise floor. Given the side of these SSD OSDs, I suspect at least one of the following is true? > > 1) They’re client aka desktop SSDs, not “enterprise” > 2) They’re a partition of a larger OSD shared with other purposes Yup. They're a mix of SATA SSDs and NVMes, but everything is consumer-grade. They're only 10% full on average and I'm not super-concerned with performance. If they did get full I'd allocate more space for them. Performance is more than adequate for the very light loads they have. > > I suspect that this alone would be enough to frustrate the balancer, which AFAIK doesn’t take overhead into consideration. You might disable the balancer module, reset the reweights to 1.00, and try the JJ balancer but I dunno that it would be night vs day. I'm not really all that concerned with SSD balancing, since if data needs to be moved around it happens almost instantaneously. They're small and on 10GbE. Also, there are no pools that cross the hdd/ssd device classes, so I would hope the balancer wouldn't get confused by having both in the cluster. > min_alloc_size? Were they created on an older Ceph release? Current defaults for [non]rotational media are both 4KB; they used to be 64KB but were changed with some churn …. around the Pacific / Octopus era IIRC. If you’re re-creating to minimize space amp, does that mean you’re running RGW with a significant fraction of small objects? With RBD — or CephFS with larger files — that isn’t so much an issue. They were created with 4k min_alloc_size. I'm increasing this to 64k for the hdd osds. I'm hoping that will improve performance a bit on large files (average file size is multiple MB at least I think), and if nothing else it seems to greatly reduce OSD RAM consumption so that alone is useful. > > Unless you were to carefully segregate larger and smaller HDDs into separate pools, right-sizing the PG could is tricky. 144 is okay, 72 is a bit low, upstream guidance notwithstanding. I would still bump some of your pg_nums a bit. The larger OSDs (which is the bulk of the capacity) have 150+ PGs right now. The small ones of course have far less. I might bump up one of the CephFS pools as it is starting to accumulate a bit more data. > >> pool 1 '.mgr' replicated size 3 min_size 2 crush_rule 7 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on pg_num_max 32 pg_num_min 1 application mgr > > > Check the CRUSH rule for this pool. On my clusters Rook creates it without specifying a device-class, but the other pools get rules that do specify a device class. The .mgr pool has 1 pg, and is set to use ssd devices only. Its 3 OSDs are all SSDs right now. > So many pools for such a small cluster …. are you actively using CephFS, RBD, *and* RGW? If not, I’d suggest removing whatever you aren’t using and adjusting pg_num for the pools you are using. So, I'm using RBD on SSD (128 PGs - maybe a bit overkill for this but those OSDs don't have anything else going on), and the bulk of the storage is on CephFS on HDD with two pools. I've been experimenting a bit with RGW but those pools are basically empty and mostly have 8 PGs each. > Is that a 2,2 or 3,1 profile? The EC pool? That is k=2, m=2. I am thinking about moving that to a 3+2 pool once I'm done with all the migration to be a bit more space-efficient, but I only have 7 nodes and they aren't completely balanced so I don't really want to stripe the data more than that. It is interesting because Quincy had no issues with the autoscaler with the exact same cluster config. It might be a Rook issue, or it might just be because so many PGs are remapped. I'll take another look at that once it reaches more of a steady state. In any case, if the balancer is designed more for equal-sized OSDs I can always just play with reweights to balance things. -- Rich _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx