Re: Question if WAL/block.db partition will benefit us

"Anthony D'Atri" <anthony.datri@xxxxxxxxx> · Thu, 11 Nov 2021 11:09:28 -0800

>> it in the documentation.
>> This sounds like a horrible SPoF. How can you recover from it? Purge the
>> OSD, wipe the disk and readd it?
>> All flash cluster is sadly not an option for our s3, as it is just too
>> large and we just bought around 60x 8TB Disks (in the last couple of
>> months).
> 
> 30 TB QLC is available. You might want to watch presentation from Anthony about it [1]. He is biased of course ;-), but you might want to do the math yourself.
> 
> Gr. Stefan
> 
> [1]: https://www.youtube.com/watch?v=wo5Ts30Gz_o
> _______________________________________________

Bedankt, Stefan!

In my defense I’d been saying the same things long before my current employment ;)  I enjoyed my first all-flash cluster in early 2017: it was *transformative*.

>> This sounds like a horrible SPoF. How can you recover from it? Purge the
>> OSD, wipe the disk and readd it?

`ceph osd destroy` can help reduce data movement during replacement operations.  SPoF is relative - this is why we replicate data.  Dan v.d. Ster noted years ago that any sufficiently large cluster must be prepared to find itself in backfill/recovery most of the time.  Various CRUSH and code improvements since then have optimized data movement as a result of failures and topology changes, but his point is still well-taken.

Boris writes of using 8TB drives.  HDD TCO is often calculated in terms of the largest HDD available at any given time, which today is 18-20TB.  I suspect that physical latencies and especially the SATA IOPS/TB and throughput/TB bottlenecks are why he’s provisioning drives that are < half the largest available today (though there may be other reasons including uniformity).  This is one of the subtle factors that must be included in an expansive TCO calculation.

Often when we consider SSDs for Ceph the first criterion we look for is durability.  Often one predicates a minimum of 1DWPD for OSD duty; I believed that myself until a couple of years ago.  Browsing through data captured into Prometheus I found that SSDs in multitenant block storage service for 3 years mostly had consumed less than 10% of their lifetime writes, which are often measured in terms of PE cycles.  Fudging a bit for ramp-up and acknowledging that individual drives will vary somewhat, I would confidently believe that those nominal 1.x DWPD drives could be expected to last easily for 10 years.  Which is double common depreciation schedules and server refresh lifetimes.

I’ve spoken with someone who believes that OSDs even need 3DWPD because of the balancer (noise-floor in terms of writes) and scrubs (all reads).  And when you look at HDD durability specs, they can actually be *less* than modern SSDs (even QLC) when one considers AFR growth after the rated TBW.

S3 / object storage can be a terrific application for QLC today. Typically these workloads are read-intensive, we might model them as 70/30 or 90/10.  I’ve recently seen a commercial RGW / S3 deployment that empirically reports 0.01 DWPD. 

As Stefan writes, I invite everyone to do the math.  Factor in not only drive unit cost/TB, but also constraints like HDD/SATA size capping, power/cooling, how much RUs and racks cost in your DC, engineer/tech time spent dealing with failures and prolonged maintenance, and user experience.  And, subtly, when using dense chassis, eg. a 45-90 HDD 4U toploader, you may find that you can only fill each rack halfway due to PDU capacity and even weight limits.  Don’t get me started on the TCO impact of RAID HBAs ;)

— aad

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx