Re: What is the problem with many PGs per OSD

"Anthony D'Atri" <anthony.datri@xxxxxxxxx> · Thu, 10 Oct 2024 09:01:18 -0400

> The main problem was the increase in ram use scaling with PGs, which in
> normal operation is often fine but as we all know balloons in failure
> conditions.

Less so with BlueStore in my experience. I think in part this surfaces a bit of Filestore legacy that we might re-examine with Filestore being deprecated.

> There are many developments that may have made things behave better

Very, very much so.

> but early on some clusters just couldn’t be recovered until they received
> double their starting ram and were babysat through careful
> manually-orchestrated startup. (Or maybe worse — I forget.)

I helped a colleague through just such a 40 hour outage, trust me the only way it coulda been worse was if it were unrecoverable, as was the lab setup I described with a 9000 ratio.  Ask Michael Kidd about the SVL disaster, he probably remembers ;)

That outage was in part precipitated (or at least exacerbated) by a user issuing a rather large number of snap trims at once.  I subsequently jacked up the snap trim cost and delay values.

We did emergency RAM upgrades followed by babysitting.  My colleague wrote a Python script that watched MemAvailable and gracefully restarted the OSDs on the given system as it reached a low water mark.  This way recovery could at least make incremental progress.  During which I increased the markdown count, and I think adjusted the reporters value.  The one and only time I’ve ever run “ceph osd pause”.  Luminous with mixed Filestore and BlueStore, and OSDs ranging from 1.6T to 3.84T.  That cluster had initially been deployed in only two racks, so the CRUSH rules weren’t ideal.  I subsequently refactored it and siblings to improve the failure domain situation and spread capacity.  In the end one larger and one smaller cluster became four clusters, each with nearly uniform OSD sizes.  And all the Filestore OSDs got redeployed in the process.

Two weeks previously I’d found that the mons in this very cluster had enough RAM to run but not enough to boot — a function of growth and dedicated mon nodes.  I’d arranged a Z0MG RAM upgrade on them.  If I hadn’t, that outage indeed would have been much, much worse.

> 
> Nobody’s run experiments, presumably because the current sizing guidelines
> are generally good enough to be getting on with, for anybody who has the
> resources to try and engage in the measurement work it would take to
> re-validate them. I will be surprised if anybody has information of the
> sort you seem to be searching for.

The inestimable Mr. Farnum here describes an opportunity for community contribution (nudge nudge wink wink ;)

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx