Re: rgw: matching small objects to pools with small min_alloc_size

"Anthony D'Atri" <anthony.datri@xxxxxxxxx> · Thu, 19 Aug 2021 00:05:33 -0700

A bit of context, expanding on Casey’s spot-on summary:

Large (“Coarse”) IU QLC drives offer a new class of storage between popular TLC SSDs and HDDs.
They offer performance, density, and operational benefits compared to HDDs at a lower cost and TCO than TLC. (up to 1.5PB raw per RU, upweighting new OSDs in less than 4 weeks ;) )
These are already being used successfully today in RGW deployments; our work here is to enhance their fit and appeal for RGW use cases.

> The problem as I understand it is having the data pool on larger Indirection Unit (e.g. 64k) QLC drives and when a user PUTs a lot of small objects (<< 64k) into this pool, it leads to lots of wasted space.

That’s part of it, the space amplification that Mark Nelson describes.  BlueStore will allocate no less than bluestore_min_alloc_size; for objects smaller than this value, storage is stranded / wasted.  This also holds true in a modulus / remainder fashion for objects sized within (but not exactly aligned to) small multiples of min_alloc_size.  Glossing over the head/tail factor for simplicity, an object sized (2 * min_alloc_size) +1 KB wastes space too.  This is exacerbated when EC is used, since Ceph AIUI only ever writes full EC stripes.

This is why Igor’s work was so valuable:  prior min_alloc_size defaulted as high as 64KB; decreasing the default can save a lot of space.  If you have predominiately or only large objects, than QLC can rock your rotational world today and we should talk ;)

Coase IU QLC, though, performs and wears best when writes are sized and aligned to the IU size, which with current models is 16KB or 64KB.  SSD endurance is is shrounded by no small measure of FUD: it turns out that few installations ever burn through a large fraction of rated endurance; moreover drives are rated conservatively.  Optimizing writes is nonetheless desirable, so we want to set min_alloc_size to match the drive’s IU.  Which thus is at odds with the efficiency to be had from a smaller min_alloc_size.

bluestore_min_alloc_size is baked into OSDs at creation, and cannot be changed later.  There was discussion a few years back that led to this implementation decision; mine the archives and GitHub if interested.

Installations that only store “large” objects (or which can enforce bucket / storageclass segregation) can benefit from such media today.  Many installations, though, see a mix of small and large objects and it may not be feasible (or realistic) to rely on users consistently directing uploads to a default (small object) or non-default (large object) storageclass hosted on appropriate media.

That’s what we want to contribute:  an opt-in way for RGW to automagically assort objects into appropriate Ceph bucket pools based on size, eg:

“Small” objects ->> default pool on TLC (or even HDD, though .. well you know)
“Large” objects ->> non-default pool on economical coarse-IU QLC

An ingest cache on something like TLC or 3D crosspoint would be sweet, but is IMHO not the best solution:
* Added complexity means more places for incomplete operations and orphaned data
* Added complexity is at odds with a solution that aims to make the user experience actually simpler.
* We don’t want to add additional constraints to the viable cluster scale or especially to require additional specific hardware; that would limit who can benefit, and possibly incur additional CapEx — which is what we’re trying to improve in the first place.

There’s an RGW layout page in TFM; one aspect is that RGW objects all have a *head* RADOS object that contains metadata.  “Small” RGW objects can also store up to 4MB of payload data in the head RADOS object; large objects add additional *tail* RADOS objects.  For the default storageclass.  Non-default storageclasses never store payload data in the head object; this is a subtlety easily missed.

So one approach would be to let small RGW objects — and head RADOS objects for large RGW objects — go to the default storageclass, likely assigned to a Ceph pool built on 4KB min_alloc_size OSDs, as today.  Large / Tail objects would be directed to a non-default pool (storageclass?) built on QLC OSDs.  Notably head objects today always reside in the default pool AIUI so that RGW can find them.

Today up to rgw_max_chunk_size payload data is stored in the head object for the default storage class, this is arguably an overload of the intent of this option/value.  We might factor in a new option, call it rgw_max_head_data or what-have-you that could be set lower than today’s 4MB to cause RGW objects in, say, the 512KB + size range, to be stored in a secondary pool for extra credit.

Storageclasses seem to offer a reasonably accessible way to segregate objects, though it’s been hinted that it might be posible to back a given storageclass with two (or more?!) pools transparently.

Another approach is as Curt alludes using Lua scripting to conditionally rewrite RGW objects’ storageclass based on size.
- Assume that chunked / multipart uploads are all “large”?
- How to attach a Lua script to *all* object writes, vs. per-tenant, per-bucket, per-object, or whatever today’s granularity is
- Making sure that user-specified and unrelated Lua scripts are not displaced.

Mark Nelson wrote:

> Ah!  That makes sense. So to play the devil's advocate: If you have some combination of bulk QLC and a smaller amount of fast high endurance storage for WAL/DB, could something like dm-cache or opencas (or if necessarily modifications to bluefs) potentially serve the same purpose without doubling the number of pools required?

A keen idea, here are a few thoughts:

WAL/DB today don’t optimally write to coarse IU QLC, but they are such a small fraction of the overall write workload that the endurance factor is into the noise floor, or at least, dare I say, negligible.  I’ve done some initial research the suggests that setting RocksDB block_size to match the IU, and enabling universal compaction may help both with performance and write amplification, at the cost of some space amplification.  You sir of course are deeply versed in these subtleties so I welcome your thoughts, but I digress.

I’m told by those who have done in-depth comparisons that OpenCAS indeed works well to accelerate certain workloads (better than dm-cache or WAL+DB).  I don’t want to predicate this work on that, though — I want the barrier to entry to be as low as possible.  As noted, wasted / padded space for small RGW objects is the larger driving factor, which I think these approaches wouldn’t address.  BlueStore AIUI will coalesce sub-min_alloc_size writes to a certain extent, which may improve QLC WAF somewhat, but I don’t think that solves the larger problem.

I’m standing on the shoulder of giants here, and I value all feedback that helps steer our approach.  Perhaps the quickest and by far the dirtiest approach would be to rewrite the storageclass header on an LB/proxy before requests hit RGW, which we may experiment with, but we’d like to work toward a better solution.

If any of the above rambling is misguided or I’m just plain out of my Vulcan min here, please do tell.

— Anthony

>  Ideally we would know an object's size on upload so we could steer objects >= IU to the QLC pool and then smaller objects to a TLC pool.  But, since we don't know the object size early enough in time, the suggestion is to put HEAD objects into a TLC-backed pool, which could handle the smaller objects, and then put tails, which implies a larger object, into the QLC-backed pool.  This is ideally without having the end S3 user having to configure anything.
> 
> Your devil's advocate idea is along those lines, I believe, of placing the small objects into cache and then larger objects into the backing storage device, I believe?  I don't know the exact implementation of Open-CAS, but caching to me implies temporal storage and if it's the small objects getting cached, they'll eventually get evicted/flushed out to the QLC pool which then causes our small object problem of wasting QLC space?
> 
> There was also some talk in the meeting about using the LUA framework to steer objects w/o any user intervention - that's something interesting too that I'll look at.
> 
> Any other suggestions or pointers are appreciated!
> 
> Thanks,
> - Curt
> 
> On Wed, Aug 18, 2021 at 2:10 PM Mark Nelson <mnelson@xxxxxxxxxx> wrote:
> 
> On 8/18/21 3:52 PM, Casey Bodley wrote:
> > On Wed, Aug 18, 2021 at 4:20 PM Mark Nelson <mnelson@xxxxxxxxxx> wrote:
> >> Hi Casey,
> >>
> >>
> >> A while back Igor refactored the code in bluestore to allow us to have
> >> small min_alloc sizes on HDDs without a significant performance penalty
> >> (this was really great work btw Igor!).  The default now is a 4k
> >> min_alloc_size on both NVMe and HDD:
> >>
> >> https://github.com/ceph/ceph/blob/master/src/common/options/global.yaml.in#L4254-L4284
> >>
> >>
> >> There was a bug causing part of this change to increase write
> >> amplification dramatically on the DB/WAL device, but this has been
> >> (mostly) fixed as of last week.  It will still likely be somewhat higher
> >> than in Nautilus (not clear yet how much this is due to more metadata vs
> >> unnecessary deferred write flushing/compaction), but the space
> >> amplification benefit is very much worth it.
> >>
> >>
> >> Mark
> > thanks Mark! sorry i didn't capture much of the background here. we've
> > been working on this with Anthony from Intel (cc'ed), who summarized
> > it this way:
> >
> > * Coarse IU QLC SSDs are an appealing alternative to HDDs for Ceph,
> > notably for RGW bucket data
> > * BlueStore’s min_alloc_size is best aligned to the IU for performance
> > and endurance; today that means 16KB or 64KB depending on the drive
> > model
> > * That means that small RGW objects can waste a significant amount of
> > space, especially when EC is used
> > * Multiple bucket data pools with appropriate media can house small vs
> > large objects via StorageClasses, but today this requires consistent
> > user action, which is often infeasible.
> >
> > so the goal isn't to reduce the alloc size for small objects, but to
> > increase it for the large objects
> 
> 
> Ah!  That makes sense. So to play the devil's advocate: If you have some 
> combination of bulk QLC and a smaller amount of fast high endurance 
> storage for WAL/DB, could something like dm-cache or opencas (or if 
> necessarily modifications to bluefs) potentially serve the same purpose 
> without doubling the number of pools required?
> 
> _______________________________________________
> Dev mailing list -- dev@xxxxxxx
> To unsubscribe send an email to dev-leave@xxxxxxx
> _______________________________________________
> Dev mailing list -- dev@xxxxxxx
> To unsubscribe send an email to dev-leave@xxxxxxx

_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx