Re: What is the problem with many PGs per OSD

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 10/10/24 08:01, Anthony D'Atri wrote:

The main problem was the increase in ram use scaling with PGs, which in
normal operation is often fine but as we all know balloons in failure
conditions.
Less so with BlueStore in my experience. I think in part this surfaces a bit of Filestore legacy that we might re-examine with Filestore being deprecated.


We have somewhat better observability around this now, but at Clyso we've still encountered situations where even having incredibly short PG log lengths can balloon into huge amounts of RAM usage with EC due to xattr rollback when there is gigantic cumulative xattrs.  We've gotten around this by literally dropping the PG log length to 1 (10 wasn't good enough!) but with more PGs this could potentially be trickier.  Having said that, I'm a huge fan of increasing PG counts (at least for small clusters) while decreasing PG log lengths.  What I would like to see us do is make pglog length a per-pool attribute and favor automatically adjusting it via the prioritycache system instead of defaulting to PG autoscaling.  I'd like to see us have the flexibility to support more pools at higher default PG counts.  Enough so that autoscaling is more of a last resort sort of thing than the first tool we reach for.

We'll also probably need to migrate to a stochastic sampling method for things like pg stat updates in the mgr (or at least lowering the update frequency), but that's a different topic.



There are many developments that may have made things behave better
Very, very much so.

Agreed.



but early on some clusters just couldn’t be recovered until they received
double their starting ram and were babysat through careful
manually-orchestrated startup. (Or maybe worse — I forget.)
I helped a colleague through just such a 40 hour outage, trust me the only way it coulda been worse was if it were unrecoverable, as was the lab setup I described with a 9000 ratio.  Ask Michael Kidd about the SVL disaster, he probably remembers ;)

That outage was in part precipitated (or at least exacerbated) by a user issuing a rather large number of snap trims at once.  I subsequently jacked up the snap trim cost and delay values.

We did emergency RAM upgrades followed by babysitting.  My colleague wrote a Python script that watched MemAvailable and gracefully restarted the OSDs on the given system as it reached a low water mark.  This way recovery could at least make incremental progress.  During which I increased the markdown count, and I think adjusted the reporters value.  The one and only time I’ve ever run “ceph osd pause”.  Luminous with mixed Filestore and BlueStore, and OSDs ranging from 1.6T to 3.84T.  That cluster had initially been deployed in only two racks, so the CRUSH rules weren’t ideal.  I subsequently refactored it and siblings to improve the failure domain situation and spread capacity.  In the end one larger and one smaller cluster became four clusters, each with nearly uniform OSD sizes.  And all the Filestore OSDs got redeployed in the process.

Two weeks previously I’d found that the mons in this very cluster had enough RAM to run but not enough to boot — a function of growth and dedicated mon nodes.  I’d arranged a Z0MG RAM upgrade on them.  If I hadn’t, that outage indeed would have been much, much worse.

Nobody’s run experiments, presumably because the current sizing guidelines
are generally good enough to be getting on with, for anybody who has the
resources to try and engage in the measurement work it would take to
re-validate them. I will be surprised if anybody has information of the
sort you seem to be searching for.
The inestimable Mr. Farnum here describes an opportunity for community contribution (nudge nudge wink wink ;)

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

--
Best Regards,
Mark Nelson
Head of Research and Development

Clyso GmbH
p: +49 89 21552391 12 | a: Minnesota, USA
w: https://clyso.com | e: mark.nelson@xxxxxxxxx

We are hiring: https://www.clyso.com/jobs/
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux