Re: Question if WAL/block.db partition will benefit us

"Anthony D'Atri" <anthony.datri@xxxxxxxxx> · Thu, 11 Nov 2021 12:53:06 -0800

> 
> It's absolutely important to think about the use case.  For most RGW cases I generally agree with you.  For something like HPC scratch storage you might have the opposite case where 3DWPD might be at the edge of what's tolerable.  Many years ago I worked for a supercomputing institute where a vendor had chosen less expensive low endurance SSDs for filesystem journals (And maybe cache?  It's been a while).  We started seeing them fail after about 6 months of constant usage.  I don't remember if we actually had blown past the endurance specs but we were pretty convinced at the time that the heavy write workload was the most likely candidate for the quick deaths.  Unfortunately in those machines the SSDs weren't in hotswap bays so replacement was painful.

I’ve been told of Ceph mon/boot drives that similarly burned out, multiple overlapping failures after a short period of use.  I don’t know specifically what they were as they predated me, but they may have been consumer / client / desktop units.

We don’t see a lot of discussion about overprovisioning and endurance management, but it can be a valuable insight and tool.  For example, in many scenarios drives may take quite a while to fill, even a year or two as the service they provide grows users and workloads.  During that ramp, the drives are seeing less write load than a steady-state calculated in terms of a full drive getting sustaining / overwrites.  If OSD drives are kept < 85% full, that 15% buffer can itself act as a sort of overprovisioning.  At least some SSDs can also be adjusted by users; a slight bump in overprovisioning can significantly increase endurance and even random write performance.

Rated lifetimes are often on the conservative side too, for multiple good reasons.

> 
>> As Stefan writes, I invite everyone to do the math.  Factor in not only drive unit cost/TB, but also constraints like HDD/SATA size capping, power/cooling, how much RUs and racks cost in your DC, engineer/tech time spent dealing with failures and prolonged maintenance, and user experience.  And, subtly, when using dense chassis, eg. a 45-90 HDD 4U toploader, you may find that you can only fill each rack halfway due to PDU capacity and even weight limits.  Don’t get me started on the TCO impact of RAID HBAs ;)
> 
> 
> Yeah, weight/power/pdu cap/cooling/sysadmin overhead are all things that tend to be overlooked unless you've got your sysadmins involved in the decision making.

Everything I wrote is borne out of personal experience (read my HBA rant from a couple of years ago), including cases where a cluster was filling up and I couldn’t expand because the racks were out of RUs, amps, or even PDU outlets.  I’ve seen HDD OSD upweights that needed to be spread over 4 weeks to avoid significant user experience degradation.  When users are paying customers, goodwill can directly correspond to revenue.  This all factors into TCO, but can be difficult to quantify.

>  The first time one of our large clusters spun up all of the racks immediately shut off because the vendor didn't spec out PDUs that could handle the power spike from having multiple nodes spinning up at the same time.  We ended up having to stagger them.  I don't remember if they ended up upgrading/adding PDUs in the end.  These are the kinds of things though that you don't think about unless you are in the trenches.

Indeed.  I’ve heard of an operation that actually spins down entire nodes or racks dynamically, and have witnessed an entire DC row losing power due to PDUs that weren’t sufficient for the contracted current — because someone spun up a larger number of VMs at once.

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx