Re: What is the problem with many PGs per OSD

Frank Schilder <frans@xxxxxx> · Wed, 9 Oct 2024 18:39:10 +0000

Hi Anthony,

replying here to points that were somewhat outside the scope of my original question:

> > That's why deploying multiple OSDs per SSD is such a great way to
> > improve performance on devices where 4K random IO throughput scales with iodepth.
>
> Mark’s testing have shown this to not be so much the case with recent releases

I use Octopus and there this is very prominent. The kv_sync_thread is a bottleneck at least until pacific. I'm not sure if this is really resolved. As far as I understood the devs were looking into splitting this thread up and decided no to, because it is easier to deploy multiple OSDs per disk. With recent rocksdb format changes (new sharding) this thread might use less resources speeding things up. Its still a synchronisation point for concurrent operations though.

> > 9000 PGs/OSD was too much for what kind of system? What CPU? How much
> > RAM? How many OSDs per host?
>
> Those were Cisco UCS… C240m3.  Dual 16c Sandy Bridge IIRC, 10x SATA HDD
> OSDs @ 3TB, 64GB I think.

And you say the OSDs were going OOM on restart? It might be possible that the PG count played a role. More likely something like the pglog-dup bug though, which had exactly this as its hallmark, insane memory ballooning on OSD startup.

The question is, was it really the resource requirements due to PG count or was it something else. That's indeed really a question I would like to have an answer from the devs to: Is there a bug in the code/rocksdb that is more likely triggered for high PG counts and that's why the weird recommendation is made?

I consider this recommendation of PG count per OSD weird, because its made independent of anything else that's available: network, CPU, RAM, disk size, disk performance, etc. etc. For literally any other parameter there are tuning guides that explain how to ramp up things, when requirements will peak and how much to have spare to survive peak loads. The PG count per OSD is a striking exception. Its just a number (well a range with 100 recommended and 200 as a max: https://docs.ceph.com/en/latest/rados/operations/pgcalc/#keyDL). It just is. And this doesn't make any sense unless there is something really evil lurking in the dark.

For comparison, a guidance that does make sense is something like 100PGs per TB. That I would vaguely understand: to keep the average PG size constant at a max of about 10G.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Anthony D'Atri <anthony.datri@xxxxxxxxx>
Sent: Wednesday, October 9, 2024 3:52 PM
To: Frank Schilder
Cc: ceph-users@xxxxxxx
Subject: Re:  What is the problem with many PGs per OSD

> Unfortunately, it doesn't really help answering my questions either.

 Sometimes the best we can do is grunt and shrug :-/. Before Nautilus we couldn’t merge PGs, so we could raise pg_num for a pool but not decrease it, so a certain fear of overshooting was established.  Mark is the go-to here.

> That's why deploying multiple OSDs per SSD is such a great way to improve performance on devices where 4K random IO throughput scales with iodepth.

Mark’s testing have shown this to not be so much the case with recent releases — do you still see this?  Until recently I was expecting 30TB TLC SSDs for RBD, and in the next year perhaps as large as 122T for object so I was thinking of splitting just because of the size - and the systems in question were overequipped with CPU.

> Memory: I have never used file store, so can't relate to that.

XFS - I experienced a lot of ballooning, to the point of OOMkilling.  In mixed clusters under duress the BlueStore OSDs consistently behaved better.

> 9000 PGs/OSD was too much for what kind of system? What CPU? How much RAM? How many OSDs per host?

Those were Cisco UCS… C240m3.  Dual 16c Sandy Bridge IIRC, 10x SATA HDD OSDs @ 3TB, 64GB I think.

> Did it even work with 200PGs with the same data (recovery after power loss)?

I didn’t have remote power control, and being a shared lab it was difficult to take a cluster down for such testing.  We did have a larger integration cluster (450 OSDs) with a PG ratio of ~~ 200 where we tested a rack power drop.  Ceph was fine (this was …. Firefly I think) but the LSI RoC HBAs lost data like crazy due to hardware, firmware, and utility bugs.

> Was it maybe the death spiral (https://ceph-users.ceph.narkive.com/KAzvjjPc/explanation-for-ceph-osd-set-nodown-and-ceph-osd-cluster-snap) that prevented the cluster from coming up and not so much the PG count?

Not in this case, though I’ve seen a similar cascading issue in another context.

> Rumors: Yes, 1000 PGs/OSD on spinners without issues. I guess we are not talking about barely working home systems with lack of all sorts of resources here.

I’d be curious how such systems behave under duress.  I’ve seen a cluster that had grown - the mons ended up with enough RAM to run but not to boot, so I did urgent RAM upgrades on the mons.  That was the mixed Filestore / BlueStore cluster (Luminous 12.2.2) where the Filestore OSDs were much more affected by a cascading event than the [mostly larger] BlueStore OSDs.  I suspect that had the whole cluster been BlueStore it might not have cascaded.

>
> The goal: Let's say I want to go 500-1000PGs/OSD on 16T spinners to trim PGs to about 10-20G each. What are the resources that count will require compared with, say, 200 PGs/OSD? That's the interesting question and if I can make the resources available I would consider doing that.

The proof is in the proverbial pudding.  Bump up pg_num on pools and see how the average / P90 ceph-osd process size changes?  Grafana FTW.  osd_map_cache_size I think defaults to 50 now; I want to say it used to be much higher.

>
> Thanks and best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: Anthony D'Atri <aad@xxxxxxxxxxxxxx>
> Sent: Wednesday, October 9, 2024 2:40 AM
> To: Frank Schilder
> Cc: ceph-users@xxxxxxx
> Subject: Re:  What is the problem with many PGs per OSD
>
> I’ve sprinkled minimizers below.  Free advice and worth every penny.  ymmv.  Do not taunt Happy Fun Ball.
>
>
>> during a lot of discussions in the past the comment that having "many PGs per OSD can lead to issues" came up without ever explaining what these issues will (not might!) be or how one would notice. It comes up as kind of a rumor without any factual or even anecdotal backing.
>
> A handful of years ago Sage IIRC retconned PG ratio guidance from 200 to 100 to help avoid OOMing, the idea being that more PGs = more RAM usage on each daemon that stores the maps.  With BlueStore’s osd_memory_target, my sense is that the ballooning seen with Filestore is much less of an issue.
>
>> As far as I can tell from experience, any increase of resource utilization due to an increase of the PG count per OSD is more than offset by the performance impact of the reduced size of the PGs. Everything seems to benefit from smaller PGs, recovery, user IO, scrubbing.
>
> My understanding is that there is serialization in the PG code, and thus the PG ratio can be thought of as the degree of parallelism the OSD device can handle.  SAS/SATA SSDs don’t seek so they can handle more than HDDS, and NVMe devices can handle more than SAS/SATA.
>
>> Yet, I'm holding back on an increase of PG count due to these rumors.
>
> My personal sense:
>
> HDD OSD:  PG ratio 100-200
> SATA/SAS SSD OSD: 200-300
> NVMe SSD OSD: 300-400
>
> These are not empirical figures.  ymmv.
>
>
>> My situation: I would like to split PGs on large HDDs. Currently, we have on average 135PGs per OSD and I would like to go for something like 450.
>
> The good Mr. Nelson may have more precise advice, but my personal sense is that I wouldn’t go higher than 200 on an HDD.  If you were at like 20 (I’ve seen it!) that would be a different story, my sense is that there are diminishing returns over say 150.  Seek thrashing fu, elevator scheduling fu, op re-ordering fu, etc.  Assuming you’re on Nautilus or later, it doesn’t hurt to experiment with your actual workload since you can scale pg_num back down.  Without Filestore colocated journals, the seek thrashing may be less of an issue than it used to be.
>
>> I heard in related rumors that some users have 1000+ PGs per OSD without problems.
>
> On spinners?  Or NVMe?  On a 60-120 TB NVMe OSD I’d be sorely tempted to try 500-1000.
>
>> I would be very much interested in a non-rumor answer, that is, not an answer of the form "it might use more RAM", "it might stress xyz". I don't care what a rumor says it might do. I would like to know what it will do.
>
> It WILL use more RAM.
>
>> I'm looking for answers of the form "a PG per OSD requires X amount of RAM fixed plus Y amount per object”
>
> Derive the size of your map and multiple by the number of OSDs per system.  My sense is that it’s on the order of MBs per OSD.  After a certain point the RAM delta might have more impact by raising osd_memory_target instead.
>
>> or "searching/indexing stuff of kind A in N PGs per OSD requires N log N/N²/... operations", "peering of N PGs per OSD requires N/N log N/N²/N*#peers/... operations". In other words, what are the *actual* resources required to host N PGs with M objects on an OSD (note that N*M is a constant per OSD). With that info one could make an informed decision, informed by facts not rumors.
>>
>> An additional question of interest is: Has anyone ever observed any detrimental effects of increasing the PG count per OSD to large values>500?
>
> Consider this scenario:
>
> An unmanaged lab setup used for successive OpenStack deployments, each of which created two RBD pools and the panoply of RGW pools.  Which nobody cleaned up before redeploys, so they accreted like plaque in the arteries of an omnivore.  Such that the PG ratio hits 9000.  Yes, 9000. Then the building loses power.  The systems don’t have nearly enough RAM to boot, peer, and activate, so the entire cluster has to be wiped and redeployed from scratch.  An extreme example, but remember that I don’t make stuff up.
>
>>
>> Thanks a lot for any clarifications in this matter!
>> =================
>> Frank Schilder
>> AIT Risø Campus
>> Bygning 109, rum S14
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx