Re: What is the problem with many PGs per OSD

Mark Nelson <mark.nelson@xxxxxxxxx> · Sun, 13 Oct 2024 22:47:38 -0500

On 10/10/24 08:01, Anthony D'Atri wrote:

The main problem was the increase in ram use scaling with PGs, which in
normal operation is often fine but as we all know balloons in failure
conditions.
Less so with BlueStore in my experience. I think in part this surfaces a bit of Filestore legacy that we might re-examine with Filestore being deprecated.

We have somewhat better observability around this now, but at Clyso 
we've still encountered situations where even having incredibly short PG 
log lengths can balloon into huge amounts of RAM usage with EC due to 
xattr rollback when there is gigantic cumulative xattrs.  We've gotten 
around this by literally dropping the PG log length to 1 (10 wasn't good 
enough!) but with more PGs this could potentially be trickier.  Having 
said that, I'm a huge fan of increasing PG counts (at least for small 
clusters) while decreasing PG log lengths.  What I would like to see us 
do is make pglog length a per-pool attribute and favor automatically 
adjusting it via the prioritycache system instead of defaulting to PG 
autoscaling.  I'd like to see us have the flexibility to support more 
pools at higher default PG counts.  Enough so that autoscaling is more 
of a last resort sort of thing than the first tool we reach for.

We'll also probably need to migrate to a stochastic sampling method for 
things like pg stat updates in the mgr (or at least lowering the update 
frequency), but that's a different topic.

There are many developments that may have made things behave better
Very, very much so.

Agreed.

but early on some clusters just couldn’t be recovered until they received
double their starting ram and were babysat through careful
manually-orchestrated startup. (Or maybe worse — I forget.)
I helped a colleague through just such a 40 hour outage, trust me the only way it coulda been worse was if it were unrecoverable, as was the lab setup I described with a 9000 ratio.  Ask Michael Kidd about the SVL disaster, he probably remembers ;)

That outage was in part precipitated (or at least exacerbated) by a user issuing a rather large number of snap trims at once.  I subsequently jacked up the snap trim cost and delay values.

We did emergency RAM upgrades followed by babysitting.  My colleague wrote a Python script that watched MemAvailable and gracefully restarted the OSDs on the given system as it reached a low water mark.  This way recovery could at least make incremental progress.  During which I increased the markdown count, and I think adjusted the reporters value.  The one and only time I’ve ever run “ceph osd pause”.  Luminous with mixed Filestore and BlueStore, and OSDs ranging from 1.6T to 3.84T.  That cluster had initially been deployed in only two racks, so the CRUSH rules weren’t ideal.  I subsequently refactored it and siblings to improve the failure domain situation and spread capacity.  In the end one larger and one smaller cluster became four clusters, each with nearly uniform OSD sizes.  And all the Filestore OSDs got redeployed in the process.

Two weeks previously I’d found that the mons in this very cluster had enough RAM to run but not enough to boot — a function of growth and dedicated mon nodes.  I’d arranged a Z0MG RAM upgrade on them.  If I hadn’t, that outage indeed would have been much, much worse.

Nobody’s run experiments, presumably because the current sizing guidelines
are generally good enough to be getting on with, for anybody who has the
resources to try and engage in the measurement work it would take to
re-validate them. I will be surprised if anybody has information of the
sort you seem to be searching for.
The inestimable Mr. Farnum here describes an opportunity for community contribution (nudge nudge wink wink ;)

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

--
Best Regards,
Mark Nelson
Head of Research and Development

Clyso GmbH
p: +49 89 21552391 12 | a: Minnesota, USA
w: https://clyso.com | e: mark.nelson@xxxxxxxxx

We are hiring: https://www.clyso.com/jobs/
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx