>>> On Thu, 10 Oct 2024 08:53:08 +0000, Frank Schilder <frans@xxxxxx> said: > The guidelines are *not* good enough for EC pools on large > HDDs that store a high percentage of small objects, in our > case, files. Arguably *nothing* is good enough for that, because it is the worst possible case scenario (A Ceph instance I inherited was like that). Ceph was designed to have large swarms of small OSSes with a 1 (or at most a few) small devices each. Having lots of small OSDs on many servers is a critical assumption of many aspects of Ceph (even if the opposite can be made to "workish"). This is very right though: > In fact, they are really bad in that case and there were a > number of recent ceph-user threads where a significant > increase in PG count would probably help a lot. > Including problems caused by very high object count per PG I'm > dancing around on our cluster. The main problem is not so much high object count per PG, it is that large PGs (in the legacy case above around 200GB) have to be "rebalanced" *in their entirety*, or go "damaged" again as a whole, impacting enormous numbers of objects. This happens whether their pool/profile is EC or replicas. Which point is part of my general observation that all systems have both a user and a background admin workload (which for storage tends to be whole-instance scans), and many are only sized to sustain the user workload, and Ceph as self-healing and self-balancing has a particularly high admin workload (resync, rebalance, scrub, "backup", ...). > For the guessing where the recommendation comes from, I'm > actually leaning towards the "PGs should be limited in size" > explanation. The recommendation of 100 PGs per OSD was good > enough for a very long time It was good for the intended use case, lots of small (by today's standards, around 1TB) OSDs on many servers working in parallel. Note: HDDs larger than 1TB are not really suitable for significant parallel user workloads and most admin workloads: http://www.sabi.co.uk/blog/17-one.html?170610#170610 > PGs were originally invented to chunk up large disks for > distributed RAID. To keep all-to-all rebuild time constant > independent of the scale of the cluster. That I guess was a secondary goal: the main goal I think was to reduce the size of metadata keeping track of redundancy (whether replicas or EC) from every-object-shard to every-PG, from a one level table to a two-level (list of PGs, list of object shards in a PG) table. The choice is then to have a larger numbers of PGs or to have a larger numbers of objects per PG as the number of total objects increases, and I too prefer the larger number of PGs. > So I question here the anecdotal reports about the PG count > being to blame alone. There have been a number of bugs > discovered that were triggered by PG splitting. I remember the legacy (Pacific) Ceph instance that I had inherited did have lots of problems because of that. _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx