Re: recommendation for barebones server with 8-12 direct attach NVMe?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]


On 16/1/24 11:39, Anthony D'Atri wrote:
by “RBD for cloud”, do you mean VM / container general-purposes volumes on which a filesystem is usually built?  Or large archive / backup volumes that are read and written sequentially without much concern for latency or throughput?

General purpose volumes for cloud instance filesystems. Performance is not high, but requirements are a moving target, and it performs better than it used to, so decision makers and users are satisfied. If more targeted requirements start to arise, of course architecture and costs will change.

How many of those ultra-dense chassis in a cluster?  Are all 60 off a single HBA?

When we deploy prod RGW there it may be 10-20 in a cluster. Yes there is a single 4 miniSAS port HBA per head node, and one of those for each chassis.

I’ve experienced RGW clusters built from 4x 90 slot ultra-dense chassis, each of which had 2x server trays, so effectively 2x 45 slot chassis bound together.  The bucket pool was EC 3,2 or 4,2.  The motherboard was …. odd, as a certain chassis vendor had a thing for at a certain point in time.  With only 12 DIMM slots each, they were chronically short on RAM and the single HBA was a bottleneck.  Performance was acceptable for the use-case …. at first.  As the cluster filled up and got busier, that was no longer the case.  And these were 8TB capped drives.  Not all slots were filled, at least initially.

The index pool was on separate 1U servers with SATA SSDs.

This sounds similar to our plans, albeit with denser nodes and a NVMe index pool. Also in our favour is that the users of the cluster we are currently intending for this have established a practice of storing large objects.

There were hotspots, usually relatively small objects that clients hammered on.  A single OSD restarting and recovering would tank the API; we found it better to destroy and redeploy it.   Expanding faster than data was coming in was a challenge, as we had to throttle the heck out of the backfill to avoid rampant slow requests and API impact.

QLC with a larger number of OSD node failure domains was a net win in that RAS was dramatically increased, and expensive engineer-hours weren’t soaked up fighting performance and availability issues.

Thank you, this is helpful information. We haven't had that kind of performance concern with our RGW on 24x 14TB nodes, but it remains to be seen how 60x 22TB behaves in practice. Rebalancing is a big consideration, particularly if we have a whole node failure. We are currently contemplating a PG split and even more IO since the growing data volume and subsequent node additions has left us with low PG/OSD ratio and it's hard for it to rebalance.

What is OLC?

Fascinating to hear about destroy-redeploy being safer than a simple restart-recover!


Agreed. I guess I wanted to add the data point that these kinds of clusters can and do make full sense in certain contexts, and push a little away from "friends don't let friends use HDDs" dogma.

If spinners work for your purposes and you don’t need IOPs or the ability to provision SSDs down the road, more power to you.

I expect our road to be long, and SSD usage will grow as the capital dollars, performance and TCO metrics change over time. For now, we limit individual cloud volumes to 300 IOPs, doubled for those who need it.
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]

  Powered by Linux