Re: recommendation for barebones server with 8-12 direct attach NVMe?

"Robin H. Johnson" <robbat2@xxxxxxxxxx> · Mon, 15 Jan 2024 18:26:26 +0000

On Mon, Jan 15, 2024 at 03:21:11PM +0000, Drew Weaver wrote:
> Oh, well what I was going to do wAs just use SATA HBAs on PowerEdge R740s because we don't really care about performance as this is just used as a copy point for backups/archival but the current Ceph cluster we have [Which is based on HDDs attached to Dell RAID controllers with each disk in RAID-0 and works just fine for us] is on EL7 and that is going to be EOL soon. So I thought it might be better on the new cluster to use HBAs instead of having the OSDs just be single disk RAID-0 volumes because I am pretty sure that's the least good scenario whether or not it has been working for us for like 8 years now.
> 
> So I asked on the list for recommendations and also read on the website and it really sounds like the only "right way" to run Ceph is by directly attaching disks to a motherboard. I had thought that HBAs were okay before but I am probably confusing that with ZFS/BSD or some other equally hyperspecific requirement. The other note was about how using NVMe seems to be the only right way now too.
> 
> I would've rather just stuck to SATA but I figured if I was going to have to buy all new servers that direct attach the SATA ports right off the motherboards to a backplane I may as well do it with NVMe (even though the price of the media will be a lot higher).
> 
> It would be cool if someone made NVMe drives that were cost competitive and had similar performance to hard drives (meaning, not super expensive but not lightning fast either) because the $/GB on datacenter NVMe drives like Kioxia, etc is still pretty far away from what it is for HDDs (obviously).

I think as a collective, the mailing list didn't do enough to ask about
your use case for the Ceph cluster earlier in the thread.

Now that you say it's just backups/archival, QLC might be excessive for
you (or a great fit if the backups are churned often).

USD70/TB is the best public large-NVME pricing I'm aware of presently; for QLC
30TB drives. Smaller capacity drives do get down to USD50/TB.
2.5" SATA spinning disk is USD20-30/TB.
All of those are much higher than the USD15-20/TB for 3.5" spinning disk
made for 24/7 operation.

Maybe it would also help as a community to explain "why" on the
perceptions of "right way".

It's a tradeoff in what you're doing, you don't want to
bottleneck/saturate critical parts of the system.

PCIe bandwidth: this goes for NVME as well as SATA/SAS.
I won't name the vendor, but I saw a weird NVME server with 50+ drive
slots.  Each drive slot was x4 lane width but had a number of PCIe
expanders in the path from the motherboard, so it you were trying to max
it out, simultaneously using all the drives, each drive only only got
~1.7x usable PCIe4.0 lanes.

Compare that to the Supermicro servers I suggested: The AMD variants use
a H13SSF motherboard, which provides 64x PCIe5.0 lanes, split into 32x
E3.S drive slots, and each drive slot has 4x PCIe 4.0, no
over-subscription.

On that same Supermicro system, how do you get the data out? There are
two PCIe 5.0 x16 slots for your network cards, so you only need to
saturate at most HALF the drives to saturate the network.

Taking this back to the SATA/SAS servers: if you had a 16-port HBA,
with only PCIe 2.0 x8, theoretical max 4GB/sec. Say you filled it with
Samsung QVO drives, and efficiently used them for 560MB/sec.
The drives can collectively get almost 9GB/sec.
=> probably worthwhile to buy a better HBA.

On the HBA side, some of the controllers, in any RAID mode (including
single-disk RAID0), cannot handle saturating every port at the same
time: the little CPU is just doing too much work. Those same controllers
in a passthrough/IT mode are fine because the CPU doesn't do work
anymore.

This turned out more rambling than I intended, but how can we capture
the 'why' of the recommendations into something usable by the community,
and have everybody be able to read that (esp. for those that don't want
to engage on a mailing list).

-- 
Robin Hugh Johnson
Gentoo Linux: Dev, Infra Lead, Foundation President & Treasurer
E-Mail   : robbat2@xxxxxxxxxx
GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85
GnuPG FP : 7D0B3CEB E9B85B1F 825BCECF EE05E6F6 A48F6136
Attachment:
signature.asc

Description: PGP signature
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx