Re: Best practice and expected benefits of using separate WAL and DB devices with Bluestore

"Anthony D'Atri" <anthony.datri@xxxxxxxxx> · Sun, 21 Apr 2024 13:08:32 -0400

>Do you have any data on the reliability of QLC NVMe drives? 

They were my job for a year, so yes, I do.  The published specs are accurate.  A QLC drive built from the same NAND as a TLC drive will have more capacity, but less endurance.  Depending on the model, you may wish to enable `bluestore_use_optimal_io_size_for_min_alloc_size` when creating your OSDs.  The Intel / Soldigim P5316, for example, has a 64KB IU size, so performance and endurance will benefit from aligning OSD `min_alloc_size` to that value.  Note that this is baked in at creation, you cannot change it on a given OSD after the fact, but you can redeploy the OSD and let it recover.

Other SKUs have 8KB or 16KB IU sizes, some have 4KB which requires no specific min_alloc_size.  Note that QLC is a good fit for workloads where writes tend to be sequential and reasonably large on average and infrequent.  I know of successful QLC RGW clusters that see 0.01 DWPD.  Yes, that decimal point is in the correct place.  Millions of 1KB files overwritten once an hour aren't a good workload for QLC. Backups, archives, even something like an OpenStack Glance pool are good fits.  I'm about to trial QLC as Prometheus LTS as well. Read-mostly workloads are good fits, as the read performance is in the ballpark of TLC.  Write performance is still going to be way better than any HDD, and you aren't stuck with legacy SATA slots.  You also don't have to buy or manage a fussy HBA.

> How old is your deep archive cluster, how many NVMes it has, and how many did you
> have to replace?

I don't personally have one at the moment.

Even with TLC, endurance is, dare I say, overrated.  99% of enterprise SSDs never burn more than 15% of their rated endurance.  SSDs from at least some manufacturers have a timed workload feature in firmware that will estimate drive lifetime when presented with a real-world workload -- this is based on observed PE cycles.

Pretty much any SSD will report lifetime used or remaining, so TLC, QLC, even MLC or SLC you should collect those metrics in your time-series DB and watch both for drives nearing EOL and their burn rates.  

> 
> On Sun, Apr 21, 2024 at 11:06 PM Anthony D'Atri <anthony.datri@xxxxxxxxx> wrote:
>> 
>> A deep archive cluster benefits from NVMe too.  You can use QLC up to 60TB in size, 32 of those in one RU makes for a cluster that doesn’t take up the whole DC.
>> 
>>> On Apr 21, 2024, at 5:42 AM, Darren Soothill <darren.soothill@xxxxxxxx> wrote:
>>> 
>>> Hi Niklaus,
>>> 
>>> Lots of questions here but let me tray and get through some of them.
>>> 
>>> Personally unless a cluster is for deep archive then I would never suggest configuring or deploying a cluster without Rocks DB and WAL on NVME.
>>> There are a number of benefits to this in terms of performance and recovery. Small writes go to the NVME first before being written to the HDD and it makes many recovery operations far more efficient.
>>> 
>>> As to how much faster it makes things that very much depends on the type of workload you have on the system. Lots of small writes will make a significant difference. Very large writes not as much of a difference.
>>> Things like compactions of the RocksDB database are a lot faster as they are now running from NVME and not from the HDD.
>>> 
>>> We normally work with  a upto 1:12 ratio so 1 NVME for every 12 HDD’s. This is assuming the NVME’s being used are good mixed use enterprise NVME’s with power loss protection.
>>> 
>>> As to failures yes a failure of the NVME would mean a loss of 12 OSD’s but this is no worse than a failure of an entire node. This is something Ceph is designed to handle.
>>> 
>>> I certainly wouldn’t be thinking about putting the NVME’s into raid sets as that will degrade the performance of them when you are trying to get better performance.
>>> 
>>> 
>>> 
>>> Darren Soothill
>>> 
>>> 
>>> Looking for help with your Ceph cluster? Contact us at https://croit.io/
>>> 
>>> croit GmbH, Freseniusstr. 31h, 81247 Munich
>>> CEO: Martin Verges - VAT-ID: DE310638492
>>> Com. register: Amtsgericht Munich HRB 231263
>>> Web: https://croit.io/ | YouTube: https://goo.gl/PGE1Bx
>>> 
>>> 
>>> 
>>> 
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> 
> 
> 
> -- 
> Alexander E. Patrakov
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx