On 10/10/24 08:01, Anthony D'Atri wrote:
The main problem was the increase in ram use scaling with PGs, which in
normal operation is often fine but as we all know balloons in failure
conditions.
Less so with BlueStore in my experience. I think in part this surfaces a bit of Filestore legacy that we might re-examine with Filestore being deprecated.
We have somewhat better observability around this now, but at Clyso
we've still encountered situations where even having incredibly short PG
log lengths can balloon into huge amounts of RAM usage with EC due to
xattr rollback when there is gigantic cumulative xattrs. We've gotten
around this by literally dropping the PG log length to 1 (10 wasn't good
enough!) but with more PGs this could potentially be trickier. Having
said that, I'm a huge fan of increasing PG counts (at least for small
clusters) while decreasing PG log lengths. What I would like to see us
do is make pglog length a per-pool attribute and favor automatically
adjusting it via the prioritycache system instead of defaulting to PG
autoscaling. I'd like to see us have the flexibility to support more
pools at higher default PG counts. Enough so that autoscaling is more
of a last resort sort of thing than the first tool we reach for.
We'll also probably need to migrate to a stochastic sampling method for
things like pg stat updates in the mgr (or at least lowering the update
frequency), but that's a different topic.
There are many developments that may have made things behave better
Very, very much so.
Agreed.
but early on some clusters just couldn’t be recovered until they received
double their starting ram and were babysat through careful
manually-orchestrated startup. (Or maybe worse — I forget.)
I helped a colleague through just such a 40 hour outage, trust me the only way it coulda been worse was if it were unrecoverable, as was the lab setup I described with a 9000 ratio. Ask Michael Kidd about the SVL disaster, he probably remembers ;)
That outage was in part precipitated (or at least exacerbated) by a user issuing a rather large number of snap trims at once. I subsequently jacked up the snap trim cost and delay values.
We did emergency RAM upgrades followed by babysitting. My colleague wrote a Python script that watched MemAvailable and gracefully restarted the OSDs on the given system as it reached a low water mark. This way recovery could at least make incremental progress. During which I increased the markdown count, and I think adjusted the reporters value. The one and only time I’ve ever run “ceph osd pause”. Luminous with mixed Filestore and BlueStore, and OSDs ranging from 1.6T to 3.84T. That cluster had initially been deployed in only two racks, so the CRUSH rules weren’t ideal. I subsequently refactored it and siblings to improve the failure domain situation and spread capacity. In the end one larger and one smaller cluster became four clusters, each with nearly uniform OSD sizes. And all the Filestore OSDs got redeployed in the process.
Two weeks previously I’d found that the mons in this very cluster had enough RAM to run but not enough to boot — a function of growth and dedicated mon nodes. I’d arranged a Z0MG RAM upgrade on them. If I hadn’t, that outage indeed would have been much, much worse.
Nobody’s run experiments, presumably because the current sizing guidelines
are generally good enough to be getting on with, for anybody who has the
resources to try and engage in the measurement work it would take to
re-validate them. I will be surprised if anybody has information of the
sort you seem to be searching for.
The inestimable Mr. Farnum here describes an opportunity for community contribution (nudge nudge wink wink ;)
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
--
Best Regards,
Mark Nelson
Head of Research and Development
Clyso GmbH
p: +49 89 21552391 12 | a: Minnesota, USA
w: https://clyso.com | e: mark.nelson@xxxxxxxxx
We are hiring: https://www.clyso.com/jobs/
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx