Re: What is the problem with many PGs per OSD

Eugen Block <eblock@xxxxxx> · Wed, 09 Oct 2024 09:09:45 +0000

I know it doesn't answer your question, I just wanted to point out  
that I'd be interested to know as well which impacts such  
configurations can have. :-) More comments inline.

Zitat von Frank Schilder <frans@xxxxxx>:

Hi Eugen,

thanks for looking at this. I followed the thread you refer to and  
it doesn't answer my question. Unfortunately, the statement

... It seems to work well, no complaints yet, but note that it's an  
archive cluster, so
the performance requirements aren't very high. ...

is reproducing the rumor that many PGs somehow impact performance in  
a negative way. What is this based on? As I wrote, since the number  
of PGs per OSD times the objects per PG = objects per OSD is a  
constant, I don't see an immediate justification for the assumption  
that more PGs imply less performance? What do you base that on? I  
don't see algorithms at work here for which splitting PGs could  
impact performance noticeably in a bad way.

I just assume that if the PG count reaches a certain number, the  
increased amount of parallel requests could overload an OSD. But I  
have no real proof for that assumption. I tend to be quite hesitant on  
customer clusters to "play around" and rather stick to the defaults as  
close as possible.

On the contrary, my experience with pools with the highest PG/OSD  
count rather says that reducing the number of objects per PG by  
splitting PGs speeds everything up. Yet, the programmers set a quite  
low limit without really explaining why. The docs just state a rumor  
without any solid information a sysadmin/user could use to decide  
whether or not its worth going high. This is of really high  
interest, because there is probably a critical value for which any  
drawbacks (if they actually exist) might outweigh the benefits and  
without some solid information based on what algorithms do the main  
work and what complexity class they have its impossible to make an  
informed decision or diagnose if this happened.

I second that, we also have usually benefitted from PG splits on each  
cluster we maintain. But at the same time we tried to avoid getting  
above the recommendations, as already stated. Many default values  
don't match real-world deployments, I've learned that a lot in the  
recent years both in Ceph and OpenStack. Maybe those recommendations  
are a bit outdated, but I'd like to learn as well how far one could go  
and which impacts are expected. Unfortunately, I only have a couple of  
virtual test clusters, I'd love to have a hardware test cluster to  
play with. :-D

Do you have performance metrics before/after? Did you actually  
observe any performance degradation? Was there an increased memory  
consumption? Anything that justifies making a statement alluding to  
(potential) negative performance impact?

Unfortunately, I don't have access to the cluster or metrics. And the  
retention time of their Prometheus instance is not very long, so no, I  
don't have anything to show. I can ask them if they did monitor that  
by any chance, but I'm not very confident that they did. :-/

Thanks and best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Eugen Block <eblock@xxxxxx>
Sent: Wednesday, October 9, 2024 9:24 AM
To: ceph-users@xxxxxxx
Subject:  Re: What is the problem with many PGs per OSD

Hi,

half a year ago I asked a related question
(https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/I3TQC42KN2FCKYV774VWJ7AVAWTTXEAA/#GLALD3DSTO6NSM2DY2PH4UCE4UBME3HM), when we needed to split huge PGs on a customer cluster. I wasn't sure either how far we could go with the ratio PGs per OSD. We increased the pg_num to the target value (4096) before new hardware arrived, temporarily the old OSDs (240 * 8 TB) had around 300 PGs/OSD, it wasn't well balanced yet. The new OSDs are larger drives (12 TB), but having the same capacity per node, and after all remapping finished and the balancer did its job, they're now at around 250 PGs/OSD for the smaller drives, 350 PGs/OSD on the larger drives. All OSDs are spinners with rocksDB on SSDs. It seems to work well, no complaints yet, but note that it's an archive cluster, so the performance requirements aren't very high. It's more about resiliency and availibilty in  
this
case.

This is all I can contribute to your question.

Zitat von Anthony D'Atri <aad@xxxxxxxxxxxxxx>:

I’ve sprinkled minimizers below.  Free advice and worth every penny.
 ymmv.  Do not taunt Happy Fun Ball.

during a lot of discussions in the past the comment that having
"many PGs per OSD can lead to issues" came up without ever
explaining what these issues will (not might!) be or how one would
notice. It comes up as kind of a rumor without any factual or even
anecdotal backing.

A handful of years ago Sage IIRC retconned PG ratio guidance from
200 to 100 to help avoid OOMing, the idea being that more PGs = more
RAM usage on each daemon that stores the maps.  With BlueStore’s
osd_memory_target, my sense is that the ballooning seen with
Filestore is much less of an issue.

As far as I can tell from experience, any increase of resource
utilization due to an increase of the PG count per OSD is more than
offset by the performance impact of the reduced size of the PGs.
Everything seems to benefit from smaller PGs, recovery, user IO,
scrubbing.

My understanding is that there is serialization in the PG code, and
thus the PG ratio can be thought of as the degree of parallelism the
OSD device can handle.  SAS/SATA SSDs don’t seek so they can handle
more than HDDS, and NVMe devices can handle more than SAS/SATA.

Yet, I'm holding back on an increase of PG count due to these rumors.

My personal sense:

HDD OSD:  PG ratio 100-200
SATA/SAS SSD OSD: 200-300
NVMe SSD OSD: 300-400

These are not empirical figures.  ymmv.

My situation: I would like to split PGs on large HDDs. Currently,
we have on average 135PGs per OSD and I would like to go for
something like 450.

The good Mr. Nelson may have more precise advice, but my personal
sense is that I wouldn’t go higher than 200 on an HDD.  If you were
at like 20 (I’ve seen it!) that would be a different story, my sense
is that there are diminishing returns over say 150.  Seek thrashing
fu, elevator scheduling fu, op re-ordering fu, etc.  Assuming you’re
on Nautilus or later, it doesn’t hurt to experiment with your actual
workload since you can scale pg_num back down.  Without Filestore
colocated journals, the seek thrashing may be less of an issue than
it used to be.

I heard in related rumors that some users have 1000+ PGs per OSD
without problems.

On spinners?  Or NVMe?  On a 60-120 TB NVMe OSD I’d be sorely
tempted to try 500-1000.

I would be very much interested in a non-rumor answer, that is, not
an answer of the form "it might use more RAM", "it might stress
xyz". I don't care what a rumor says it might do. I would like to
know what it will do.

It WILL use more RAM.

I'm looking for answers of the form "a PG per OSD requires X amount
of RAM fixed plus Y amount per object”

Derive the size of your map and multiple by the number of OSDs per
system.  My sense is that it’s on the order of MBs per OSD.  After a
certain point the RAM delta might have more impact by raising
osd_memory_target instead.

or "searching/indexing stuff of kind A in N PGs per OSD requires N
log N/N²/... operations", "peering of N PGs per OSD requires N/N
log N/N²/N*#peers/... operations". In other words, what are the
*actual* resources required to host N PGs with M objects on an OSD
(note that N*M is a constant per OSD). With that info one could
make an informed decision, informed by facts not rumors.

An additional question of interest is: Has anyone ever observed any
detrimental effects of increasing the PG count per OSD to large
values>500?

Consider this scenario:

An unmanaged lab setup used for successive OpenStack deployments,
each of which created two RBD pools and the panoply of RGW pools.
Which nobody cleaned up before redeploys, so they accreted like
plaque in the arteries of an omnivore.  Such that the PG ratio hits
9000.  Yes, 9000. Then the building loses power.  The systems don’t
have nearly enough RAM to boot, peer, and activate, so the entire
cluster has to be wiped and redeployed from scratch.  An extreme
example, but remember that I don’t make stuff up.

Thanks a lot for any clarifications in this matter!
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx