Re: What is the problem with many PGs per OSD

pg@xxxxxxxxxxxxxxxxxxxx (Peter Grandi) · Thu, 10 Oct 2024 12:01:00 +0100

>>> On Thu, 10 Oct 2024 08:53:08 +0000, Frank Schilder <frans@xxxxxx> said:

> The guidelines are *not* good enough for EC pools on large
> HDDs that store a high percentage of small objects, in our
> case, files.

Arguably *nothing* is good enough for that, because it is the
worst possible case scenario (A Ceph instance I inherited was
like that). Ceph was designed to have large swarms of small
OSSes with a 1 (or at most a few) small devices each. Having
lots of small OSDs on many servers is a critical assumption of
many aspects of Ceph (even if the opposite can be made to
"workish").

This is very right though:

> In fact, they are really bad in that case and there were a
> number of recent ceph-user threads where a significant
> increase in PG count would probably help a lot.

> Including problems caused by very high object count per PG I'm
> dancing around on our cluster.

The main problem is not so much high object count per PG, it is
that large PGs (in the legacy case above around 200GB) have to
be "rebalanced" *in their entirety*, or go "damaged" again as a
whole, impacting enormous numbers of objects. This happens
whether their pool/profile is EC or replicas.

Which point is part of my general observation that all systems
have both a user and a background admin workload (which for
storage tends to be whole-instance scans), and many are only
sized to sustain the user workload, and Ceph as self-healing and
self-balancing has a particularly high admin workload (resync,
rebalance, scrub, "backup", ...).

> For the guessing where the recommendation comes from, I'm
> actually leaning towards the "PGs should be limited in size"
> explanation. The recommendation of 100 PGs per OSD was good
> enough for a very long time

It was good for the intended use case, lots of small (by today's
standards, around 1TB) OSDs on many servers working in parallel.

Note: HDDs larger than 1TB are not really suitable for
significant parallel user workloads and most admin workloads:
http://www.sabi.co.uk/blog/17-one.html?170610#170610

> PGs were originally invented to chunk up large disks for
> distributed RAID. To keep all-to-all rebuild time constant
> independent of the scale of the cluster.

That I guess was a secondary goal: the main goal I think was to
reduce the size of metadata keeping track of redundancy (whether
replicas or EC) from every-object-shard to every-PG, from a one
level table to a two-level (list of PGs, list of object shards
in a PG) table.

The choice is then to have a larger numbers of PGs or to have a
larger numbers of objects per PG as the number of total objects
increases, and I too prefer the larger number of PGs.

> So I question here the anecdotal reports about the PG count
> being to blame alone. There have been a number of bugs
> discovered that were triggered by PG splitting.

I remember the legacy (Pacific) Ceph instance that I had
inherited did have lots of problems because of that.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx