Re: recommendation for barebones server with 8-12 direct attach NVMe?

Gregory Orange <gregory.orange@xxxxxxxxxxxxx> · Wed, 17 Jan 2024 22:47:17 +0800

On 16/1/24 11:39, Anthony D'Atri wrote:
by “RBD for cloud”, do you mean VM / container general-purposes volumes 
on which a filesystem is usually built?  Or large archive / backup 
volumes that are read and written sequentially without much concern for 
latency or throughput?

General purpose volumes for cloud instance filesystems. Performance is 
not high, but requirements are a moving target, and it performs better 
than it used to, so decision makers and users are satisfied. If more 
targeted requirements start to arise, of course architecture and costs 
will change.

How many of those ultra-dense chassis in a cluster?  Are all 60 off a 
single HBA?

When we deploy prod RGW there it may be 10-20 in a cluster. Yes there is 
a single 4 miniSAS port HBA per head node, and one of those for each 
chassis.

I’ve experienced RGW clusters built from 4x 90 slot ultra-dense chassis, 
each of which had 2x server trays, so effectively 2x 45 slot chassis 
bound together.  The bucket pool was EC 3,2 or 4,2.  The motherboard was 
…. odd, as a certain chassis vendor had a thing for at a certain point 
in time.  With only 12 DIMM slots each, they were chronically short on 
RAM and the single HBA was a bottleneck.  Performance was acceptable for 
the use-case …. at first.  As the cluster filled up and got busier, that 
was no longer the case.  And these were 8TB capped drives.  Not all 
slots were filled, at least initially.

The index pool was on separate 1U servers with SATA SSDs.

This sounds similar to our plans, albeit with denser nodes and a NVMe 
index pool. Also in our favour is that the users of the cluster we are 
currently intending for this have established a practice of storing 
large objects.

There were hotspots, usually relatively small objects that clients 
hammered on.  A single OSD restarting and recovering would tank the API; 
we found it better to destroy and redeploy it.   Expanding faster than 
data was coming in was a challenge, as we had to throttle the heck out 
of the backfill to avoid rampant slow requests and API impact.

QLC with a larger number of OSD node failure domains was a net win in 
that RAS was dramatically increased, and expensive engineer-hours 
weren’t soaked up fighting performance and availability issues.

Thank you, this is helpful information. We haven't had that kind of 
performance concern with our RGW on 24x 14TB nodes, but it remains to be 
seen how 60x 22TB behaves in practice. Rebalancing is a big 
consideration, particularly if we have a whole node failure. We are 
currently contemplating a PG split and even more IO since the growing 
data volume and subsequent node additions has left us with low PG/OSD 
ratio and it's hard for it to rebalance.

What is OLC?

Fascinating to hear about destroy-redeploy being safer than a simple 
restart-recover!

ymmv

Agreed. I guess I wanted to add the data point that these kinds of 
clusters can and do make full sense in certain contexts, and push a 
little away from "friends don't let friends use HDDs" dogma.

If spinners work for your 
purposes and you don’t need IOPs or the ability to provision SSDs down 
the road, more power to you.

I expect our road to be long, and SSD usage will grow as the capital 
dollars, performance and TCO metrics change over time. For now, we limit 
individual cloud volumes to 300 IOPs, doubled for those who need it.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx