Re: recommendation for barebones server with 8-12 direct attach NVMe?

"Anthony D'Atri" <anthony.datri@xxxxxxxxx> · Mon, 15 Jan 2024 11:28:38 -0500

> Oh, well what I was going to do was just use SATA HBAs on PowerEdge R740s because we don't really care about performance

That is important context.

> as this is just used as a copy point for backups/archival but the current Ceph cluster we have [Which is based on HDDs attached to Dell RAID controllers with each disk in RAID-0 and works just fine for us]

The H330?  You can set passthrough / JBOD / HBA personality and avoid the RAID0 dance.

> is on EL7 and that is going to be EOL soon. So I thought it might be better on the new cluster to use HBAs instead of having the OSDs just be single disk RAID-0 volumes because I am pretty sure that's the least good scenario whether or not it has been working for us for like 8 years now.

See above.

> So I asked on the list for recommendations and also read on the website and it really sounds like the only "right way" to run Ceph is by directly attaching disks to a motherboard

That isn’t quite what I meant.

If one is specking out *new* hardware:

* HDDs are a false economy
* SATA / SAS SSDs hobble performance for little or no cost savings over NVMe
* RAID HBAs are fussy and a waste of money in 2023

>  I had thought that HBAs were okay before

By HBA I suspect you mean a non-RAID HBA?

> but I am probably confusing that with ZFS/BSD or some other equally hyperspecific requirement.

ZFS indeed prefers as little as possible between it and the drives.  The benefits for Ceph are not identical but very congruent.

> The other note was about how using NVMe seems to be the only right way now too.

If we predicate that HDDs are a dead end, then that leaves us with SAS/SATA SSD vs NVMe SSD.

SAS is all but dead, and carries a price penalty.
SATA SSDs are steadily declining in the market.  5-10 years from now I suspect that no more than one manufacturer of enterprise-class SATA SSDs will remain.  The future is PCI. SATA SSDs don’t save any money over NVMe SSDs, and additionally require some sort of HBA, be it an add-in card or on the motherboard.  SATA and NVMe SSDs use the same NAND, just with a different interface.

> I would've rather just stuck to SATA but I figured if I was going to have to buy all new servers that direct attach the SATA ports right off the motherboards to a backplane

On-board SATA chips may be relatively weak but I don’t know much about current implementations.

> I may as well do it with NVMe (even though the price of the media will be a lot higher).

NVMe SSDs shouldn’t cost significantly more than SATA SSDs.  Hint:  certain tier-one chassis manufacturers mark both the fsck up.  You can get a better warranty and pricing by buying drives from a VAR.

> It would be cool if someone made NVMe drives that were cost competitive and had similar performance to hard drives (meaning, not super expensive but not lightning fast either) because the $/GB on datacenter NVMe drives like Kioxia, etc is still pretty far away from what it is for HDDs (obviously).

It’s a trap!  Which is to say, that the $/GB really isn’t far away, and in fact once you step back to TCO from the unit economics of the drive in insolation, the HDDs often turn out to be *more* expensive.

Pore through this:  https://www.snia.org/forums/cmsi/programs/TCOcalc

* $/IOPS are higher for any HDD compared to NAND
* HDDs are available up to what, 22TB these days?  With the same tired SATA interface as when they were 2TB.  That’s rather a bottleneck.  We see HDD clusters limiting themselves to 8-10TB HDDs all the time; in fact AIUI RHCS stipulates no larger than 10TB.  Feed that into the equation and the TCO changes a bunch
* HDDs not only hobble steady-state performance, but under duress — expansion, component failure, etc., the impact to client operations will be higher and recovery to desired redundancy will be much longer.  I’ve seen a cluster — especially when using EC — take *4 weeks* to weight an 8TB HDD OSD up or down.  Consider the operational cost and risk of that.  The SNIA calc has a performance multiplier that accounts for this.
* A SATA chassis is stuck with SATA, 5-10 years from now that will be increasingly limiting, especially if you go with LFF drives
* RUs cost money.  A 1U LFF server can hold what, at most 88TB raw when using HDDs?  With 60TB SSDs (*) one can fit 600TB of raw space into the same RU.

* If they meet your needs

> 
> Anyway thanks.
> -Drew
> 
> 
> 
> 
> 
> -----Original Message-----
> From: Robin H. Johnson <robbat2@xxxxxxxxxx> 
> Sent: Sunday, January 14, 2024 5:00 PM
> To: ceph-users@xxxxxxx
> Subject:  Re: recommendation for barebones server with 8-12 direct attach NVMe?
> 
> On Fri, Jan 12, 2024 at 02:32:12PM +0000, Drew Weaver wrote:
>> Hello,
>> 
>> So we were going to replace a Ceph cluster with some hardware we had 
>> laying around using SATA HBAs but I was told that the only right way 
>> to build Ceph in 2023 is with direct attach NVMe.
>> 
>> Does anyone have any recommendation for a 1U barebones server (we just 
>> drop in ram disks and cpus) with 8-10 2.5" NVMe bays that are direct 
>> attached to the motherboard without a bridge or HBA for Ceph 
>> specifically?
> If you're buying new, Supermicro would be my first choice for vendor based on experience.
> https://www.supermicro.com/en/products/nvme
> 
> You said 2.5" bays, which makes me think you have existing drives.
> There are models to fit that, but if you're also considering new drives, you can get further density in E1/E3
> 
> The only caveat is that you will absolutely want to put a better NIC in these systems, because 2x10G is easy to saturate with a pile of NVME.
> 
> --
> Robin Hugh Johnson
> Gentoo Linux: Dev, Infra Lead, Foundation President & Treasurer
> E-Mail   : robbat2@xxxxxxxxxx
> GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85 GnuPG FP : 7D0B3CEB E9B85B1F 825BCECF EE05E6F6 A48F6136
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx