Re: Question if WAL/block.db partition will benefit us

Mark Nelson <mnelson@xxxxxxxxxx> · Thu, 11 Nov 2021 13:43:16 -0600

On 11/11/21 1:09 PM, Anthony D'Atri wrote:
it in the documentation.
This sounds like a horrible SPoF. How can you recover from it? Purge the
OSD, wipe the disk and readd it?
All flash cluster is sadly not an option for our s3, as it is just too
large and we just bought around 60x 8TB Disks (in the last couple of
months).
30 TB QLC is available. You might want to watch presentation from Anthony about it [1]. He is biased of course ;-), but you might want to do the math yourself.

Gr. Stefan

[1]: https://www.youtube.com/watch?v=wo5Ts30Gz_o
_______________________________________________
Bedankt, Stefan!

In my defense I’d been saying the same things long before my current employment ;)  I enjoyed my first all-flash cluster in early 2017: it was *transformative*.

This sounds like a horrible SPoF. How can you recover from it? Purge the
OSD, wipe the disk and readd it?
`ceph osd destroy` can help reduce data movement during replacement operations.  SPoF is relative - this is why we replicate data.  Dan v.d. Ster noted years ago that any sufficiently large cluster must be prepared to find itself in backfill/recovery most of the time.  Various CRUSH and code improvements since then have optimized data movement as a result of failures and topology changes, but his point is still well-taken.

Boris writes of using 8TB drives.  HDD TCO is often calculated in terms of the largest HDD available at any given time, which today is 18-20TB.  I suspect that physical latencies and especially the SATA IOPS/TB and throughput/TB bottlenecks are why he’s provisioning drives that are < half the largest available today (though there may be other reasons including uniformity).  This is one of the subtle factors that must be included in an expansive TCO calculation.

Often when we consider SSDs for Ceph the first criterion we look for is durability.  Often one predicates a minimum of 1DWPD for OSD duty; I believed that myself until a couple of years ago.  Browsing through data captured into Prometheus I found that SSDs in multitenant block storage service for 3 years mostly had consumed less than 10% of their lifetime writes, which are often measured in terms of PE cycles.  Fudging a bit for ramp-up and acknowledging that individual drives will vary somewhat, I would confidently believe that those nominal 1.x DWPD drives could be expected to last easily for 10 years.  Which is double common depreciation schedules and server refresh lifetimes.

I’ve spoken with someone who believes that OSDs even need 3DWPD because of the balancer (noise-floor in terms of writes) and scrubs (all reads).  And when you look at HDD durability specs, they can actually be *less* than modern SSDs (even QLC) when one considers AFR growth after the rated TBW.

S3 / object storage can be a terrific application for QLC today. Typically these workloads are read-intensive, we might model them as 70/30 or 90/10.  I’ve recently seen a commercial RGW / S3 deployment that empirically reports 0.01 DWPD.

It's absolutely important to think about the use case.  For most RGW 
cases I generally agree with you.  For something like HPC scratch 
storage you might have the opposite case where 3DWPD might be at the 
edge of what's tolerable.  Many years ago I worked for a supercomputing 
institute where a vendor had chosen less expensive low endurance SSDs 
for filesystem journals (And maybe cache?  It's been a while).  We 
started seeing them fail after about 6 months of constant usage.  I 
don't remember if we actually had blown past the endurance specs but we 
were pretty convinced at the time that the heavy write workload was the 
most likely candidate for the quick deaths.  Unfortunately in those 
machines the SSDs weren't in hotswap bays so replacement was painful.

As Stefan writes, I invite everyone to do the math.  Factor in not only drive unit cost/TB, but also constraints like HDD/SATA size capping, power/cooling, how much RUs and racks cost in your DC, engineer/tech time spent dealing with failures and prolonged maintenance, and user experience.  And, subtly, when using dense chassis, eg. a 45-90 HDD 4U toploader, you may find that you can only fill each rack halfway due to PDU capacity and even weight limits.  Don’t get me started on the TCO impact of RAID HBAs ;)

Yeah, weight/power/pdu cap/cooling/sysadmin overhead are all things that 
tend to be overlooked unless you've got your sysadmins involved in the 
decision making.  The first time one of our large clusters spun up all 
of the racks immediately shut off because the vendor didn't spec out 
PDUs that could handle the power spike from having multiple nodes 
spinning up at the same time.  We ended up having to stagger them.  I 
don't remember if they ended up upgrading/adding PDUs in the end.  These 
are the kinds of things though that you don't think about unless you are 
in the trenches.

— aad

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx