On 16/1/24 11:39, Anthony D'Atri wrote:
by “RBD for cloud”, do you mean VM / container general-purposes volumes
on which a filesystem is usually built? Or large archive / backup
volumes that are read and written sequentially without much concern for
latency or throughput?
General purpose volumes for cloud instance filesystems. Performance is
not high, but requirements are a moving target, and it performs better
than it used to, so decision makers and users are satisfied. If more
targeted requirements start to arise, of course architecture and costs
will change.
How many of those ultra-dense chassis in a cluster? Are all 60 off a
single HBA?
When we deploy prod RGW there it may be 10-20 in a cluster. Yes there is
a single 4 miniSAS port HBA per head node, and one of those for each
chassis.
I’ve experienced RGW clusters built from 4x 90 slot ultra-dense chassis,
each of which had 2x server trays, so effectively 2x 45 slot chassis
bound together. The bucket pool was EC 3,2 or 4,2. The motherboard was
…. odd, as a certain chassis vendor had a thing for at a certain point
in time. With only 12 DIMM slots each, they were chronically short on
RAM and the single HBA was a bottleneck. Performance was acceptable for
the use-case …. at first. As the cluster filled up and got busier, that
was no longer the case. And these were 8TB capped drives. Not all
slots were filled, at least initially.
The index pool was on separate 1U servers with SATA SSDs.
This sounds similar to our plans, albeit with denser nodes and a NVMe
index pool. Also in our favour is that the users of the cluster we are
currently intending for this have established a practice of storing
large objects.
There were hotspots, usually relatively small objects that clients
hammered on. A single OSD restarting and recovering would tank the API;
we found it better to destroy and redeploy it. Expanding faster than
data was coming in was a challenge, as we had to throttle the heck out
of the backfill to avoid rampant slow requests and API impact.
QLC with a larger number of OSD node failure domains was a net win in
that RAS was dramatically increased, and expensive engineer-hours
weren’t soaked up fighting performance and availability issues.
Thank you, this is helpful information. We haven't had that kind of
performance concern with our RGW on 24x 14TB nodes, but it remains to be
seen how 60x 22TB behaves in practice. Rebalancing is a big
consideration, particularly if we have a whole node failure. We are
currently contemplating a PG split and even more IO since the growing
data volume and subsequent node additions has left us with low PG/OSD
ratio and it's hard for it to rebalance.
What is OLC?
Fascinating to hear about destroy-redeploy being safer than a simple
restart-recover!
ymmv
Agreed. I guess I wanted to add the data point that these kinds of
clusters can and do make full sense in certain contexts, and push a
little away from "friends don't let friends use HDDs" dogma.
If spinners work for your
purposes and you don’t need IOPs or the ability to provision SSDs down
the road, more power to you.
I expect our road to be long, and SSD usage will grow as the capital
dollars, performance and TCO metrics change over time. For now, we limit
individual cloud volumes to 300 IOPs, doubled for those who need it.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx