Re: Building new cluster had a couple of questions

Drew Weaver <drew.weaver@xxxxxxxxxx> · Tue, 26 Dec 2023 16:38:07 +0000

Okay so NVMe is the only path forward?

I was simply going to replace the PERC H750s with some HBA350s but if that will not work I will just wait until I have a pile of NVMe servers that we aren’t using in a few years, I guess.

Thanks,
-Drew

From: Anthony D'Atri <anthony.datri@xxxxxxxxx> 

Sent: Friday, December 22, 2023 12:33 PM

To: Drew Weaver <drew.weaver@xxxxxxxxxx>

Cc: ceph-users@xxxxxxx

Subject: Re:  Building new cluster had a couple of questions

Sorry I thought of one more thing.

I was actually re-reading the hardware recommendations for Ceph and it seems to imply that both RAID controllers as well as HBAs are bad ideas.

Advice I added most likely ;)   "RAID controllers" *are* a subset of HBAs BTW.  The nomenclature can be confusing and there's this guy on Reddit ....

I believe I remember knowing that RAID controllers are sub optimal but I guess I don't understand how you would actually build a cluster with many disks 12-14 each per server without any HBAs in the servers.

NVMe

Are there certain HBAs that are worse than others? Sorry I am just confused.

For liability and professionalism I won't name names, especially in serverland there's one manufacturer who dominates.

There are three main points, informed by years of wrangling the things.  I posted a litany of my experiences to this very list a few years back, including data-losing firmware / utility bugs and operationally expensive ECOs.

* RoC HBAs aka "RAID controllers" are IMHO a throwback to the days when x86 / x64 servers didn't have good software RAID.  In the land of the Sun we had VxVM (at $$$) that worked well, and Sun's SDS/ODS that ... got better over time.  I
 dunno if the Microsoft world has bootable software RAID now or not.  They are in my experience flaky and a pain to monitor.  Granted they offer the potential for a system to boot without intervention if the first boot drive is horqued, but IMHO that doesn't
 happen nearly often enough to justify the hassle.

* These things can have significant cost, especially if one shells out for cache RAM, BBU/FBWC, etc.  Today I have a handful of legacy systems that were purchased with a tri-mode HBA that in 2018 had a list price of USD 2000.  The  *only*
 thing it's doing is mirroring two SATA boot drives.  That money would be better spent on SSDs, either with a non-RAID aka JBOD HBA, or better NVMe.  

* RAID HBAs confound observability.  Many models today have a JBOD / passthrough mode -- in which case why pay for all the RAID-fu?  Some, bizarrely, still don't, and one must set up a single-drive RAID0 volume around every drive for the
 system to see it.  This makes iostat even less useful than it already is, and one has to jump through hoops to get SMART info.  Hoops that, for example, the very promising upstream smartctl_exporter doesn't have.

There's a certain current M.2 boot drive module like this, the OS cannot see the drives AT ALL unless they're in a virtual drive.  Like Chong said, there's too much recession.

When using SATA or SAS, you can get a non-RAID HBA for much less money than a RAID HBA.  But the nuance here is that unless you have pre-existing gear, SATA and especially SATA *do not save money*.  This is heterodox to conventional wisdom.

An NVMe-only chassis does not need an HBA of any kind.  NVMe *is* PCI-e.  It especially doesn't need an astronomically expensive NVMe-capable RAID controller, at least not for uses like Ceph.  If one has an unusual use-case that absolutely
 requires a single volume, if LVM doesn't cut it for some reason -- maybe.  And there are things like Phison and Scaleflux that are out of scope, we're talking about Ceph here.

Some chassis vendors try hard to stuff an RoC HBA down your throat, with rather high markups.  Others may offer a basic SATA HBA built into the motherboard if you need it for some reason.

So when you don't have to spend USD hundreds to a thousand on an RoC HBA + BBU/cache/FBWC and jump through hoops to have one more thing to monitor, and an NVMe SSD doesn't cost significantly more than a SATA SSD, an all-NVMe system can
 easily be *less* expensive than SATA or especially SAS.  SAS is very, very much in its last days in the marketplace; SATA is right behind it.  In 5-10 years you'll be hard-pressed to find enterprise SAS/SATA SSDs, or if you can, might only be from a single
 manufacturer -- which is an Akbarian trap.

This calculator can help show the false economy of SATA / SAS, and especially of HDDs.  Yes, in the enterprise, HDDs are *not* less expensive unless you're a slave to $chassisvendor.

Total
 Cost of Ownership (TCO) Model for Storage
snia.org

You read that right.

Don't plug in the cost of a 22TB SMR SATA drive, it likely won't be usable in real life.  It's not uncommon to limit spinners to say 8TB just because of the interface and seek bottlenecks.  The above tool has a multiplier for how many additional
 spindles one has to provision to get semi-acceptable IOPS, for RUs, power, AFR, cost to repair, etc.

At scale, consider how many chassis you need when you can stuff 32x 60TB SSDs into one, vs 12x 8TB HDDs.  Consider also the risks when it takes a couple of weeks to migrate data onto a replacement spinner, or if you can't do maintenance
 because your cluster becomes unusable to users.

-- aad

Thanks,

-Drew

-----Original Message-----

From: Drew Weaver <drew.weaver@xxxxxxxxxx>

Sent: Thursday, December 21, 2023 8:51 AM

To: 'ceph-users@xxxxxxx' <ceph-users@xxxxxxx>

Subject:  Building new cluster had a couple of questions

Howdy,

I am going to be replacing an old cluster pretty soon and I am looking for a few suggestions.

#1 cephadm or ceph-ansible for management?

#2 Since the whole... CentOS thing... what distro appears to be the most straightforward to use with Ceph?  I was going to try and deploy it on Rocky 9.

That is all I have.

Thanks,

-Drew

_______________________________________________

ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to
ceph-users-leave@xxxxxxx

_______________________________________________

ceph-users mailing list -- ceph-users@xxxxxxx

To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx