Re: rgw: matching small objects to pools with small min_alloc_size

"D'Atri, Anthony" <anthony.datri@xxxxxxxxx> · Thu, 26 Aug 2021 20:25:14 +0000

Coalescing replies for efficiency.

Re Mark’s thoughts:

> Yep, with larger min_alloc sizes you'll be increasing space-amp for small objects.  Definitely a cause for concern!  So regarding the idea:  We've been getting requests lately to increase the overall pool limits in ceph. 

Limits as in the total number of pools one can provision?

> Given that these devices are designed to store vast quantities of data, I don't think we should take the 2x pool multiplier requirement lightly.  It seems like the people that want to use these huge QLC drives may be the same people that want lots of pools. 

I don’t follow.  I might guess that people asking for lots of pools want to use them for an additional layer of tenant isolation, but I think that’s unrelated.  To be clear, the proposal here isn’t to double up all pools, it’s to add just one more to deployments that wish to benefit from SSD performance without the expense of substantial space amplification. Deployments that do not wish to exploit the ability would not add any pools at all.

> Generally speaking I'm hoping we can actually reduce the number of pools RGW needs rather than increasing it. 

For sure RGW historically has had a bunch of pools, with unclear uses.  I submit that adding a single pool only for deployments that opt in is not a big deal.

> I'm also a little concerned that splitting data across pools at the RGW layer is introducing a lot of complexity higher up that could be handled down closer to the disk. Sometimes it's inevitable that details like device indirection unit get exposed higher up the stack, but in this case I'm not sure we really need to.

Doing it at the RGW layer, though, frees us from having to reimplement the wheel in every OSD backend.  It also allows us to measure and manage capacity of small / large objects independently by adding appropriate OSDs to grow capacity in the desired pool - which is standard Ceph operational procedure.

>> Your devil's advocate idea is along those lines, I believe, of placing the small objects into cache and then larger objects into the backing storage device, I believe?  I don't know the exact implementation of Open-CAS, but caching to me implies temporal storage and if it's the small objects getting cached, they'll eventually get evicted/flushed out to the QLC pool which then causes our small object problem of wasting QLC space?
> 
> 
> One of the questions I had for the opencas guys a while back was whether or not you could provide hints to pin specific data to the fast device (say rocksdb L0 if we just wanted to have opencas handle the entire device instead of having bluefs use a separate partions for DB/WAL).  I believe Intel said this is possible, potentially by tagging a specific region of the block device as a pinned area.  Maybe one of the opencas folks from Intel can chime in here.  So potentially the way this could work would be the OSD looks at the size of the object or gets a hint from RGW or something, and then bluestore sees that and via bluefs writes the object to specific regions of the opencas backed storage

I don’t want to make OpenCAS a requirement here, that would substantially complicate and discourage deployments.  Plus the OSD has no way of knowing how large the RGW / S3 object is, and will only get a fraction of larger objects. This would also confound capacity management.  Ceph has pools for reasons; I don’t think splitting up capacity into potentially thousands of independent, relatively small partitions is practical.

Maybe we’re thinking of different things here.

> so that it stays pinned on optane or pdimm

We don’t want to require either of those technologies.  That would negate the goal, and likely cost as much as using expensive TLC instead with much less complexity.

OpenCAS, Optane, and Pmem are terrific technologies, but they’re orthogonal to what we’re trying to accomplish here.

> just like the associated rocksdb metadata.  Alternately we could teach bluefs itself to write this kind of data directly to the fast device and dm-cache/opencas would not strictly be necessary. 

s/fast/cost-effective?  Remember this is driven by space amplification, and reads from QLC are plenty fast.  Doing it in bluefs would mean having to duplication in Crimson and any other / future OSD backends.  And again we’d be massively fragmenting the capacity pool.

> One of the things I like about this approach is that bluefs already has to think about how to manage space between fast and slow devices (ie what to do when there's not enough space on the fast device to store SST files).  It seems to me that the problem of small objects on QLC is very similar.  There may be cases where you want to prioritize small objects being pinned on the fast device rather than seldomly accessed SST files (say for L2+).  We can better prioritize that by handling at the bluefs layer rather than having bluefs and rgw using different approaches to fight over who gets to put data on the fast devices.

I don’t think it’s similar, for the above reasons plus an OSD’s RocksDB doesn’t (Firefly notwithstanding) grow indefinitely. Again this isn’t about device speed or caching.  Small objects can eg. be cached using Varnish on load balancers in front of RGW, but that’s not what we’re working on here.

Re Matt’s thoughts:

> - not form code perspective but from usability:
> 
> Distributing objects based on their (perceived) size across different pools in an opaque way may break data placement decisions or at least make those more complicated to configure those properly in medium sized clusters and may further complicate also load distribution (and assessment and design) because of unknown distribution scheme
> . Where I could think that somebody could easily live with placement of head objects into a special pool which can be specifically designed, the distribution based on object size might also complicate the configuration for tiering of objects and/or placement of objects of different purpose at designated costs. 

Placement of objects at designated costs is *exactly& what we’re working to accomplish here, this work will accomplish that, not confound it.  And again, this would be entirely opt-in, so it would not affect anyone else.  Head / small objects don’t need a specially designed pool, they can go into a normal pool like they do today, on 4KB-congruent IU media (HDD, TLC, Optane, etc).

> While respecting the different capabilities of devices and leveraging knowledge about those seems to be viable, I would rather see the duty of proper handling of those in the OSD layer.

I don’t think doing this at the OSD layer is viable, cf. my notes above.  Plus note how many people on ceph-users@ report difficulties with managing OSD deployment and lifecycle when OSDs span and share multiple devices.  There’s a lot of benefit to not requiring such a strategy.

> - for the code perspective and handling:
> 
> Further more, we then need to start to distinguish not only between HDD and flash in general but also between different kinds of flash.

We don’t; an operator wishing to make use of segregated pools would opt in by making the appropriate configuration.  But as a tangent, it wouldn’t be a bad idea to distinguish among different kinds of flash, eg. SATA and NVMe devices are treated (mostly?) identically.  eg. NVMe OSDs get assigned the `ssd` device class.  That can be retrofitted, but it’d be terrific to expand the utility of device classes for CRUSH and osd_spec rules, vs. having to manage them manually and/or maintain multi-root CRUSH trees.   But again that’s a tangent.

> In this regard, how HDD backed pools would be used in such a way of determining the data placement inside the RGW ? Is then the HDD based pool with nowadays 4 K min_alloc_size  more suited for small objects, too ?

That’s up the operator.  Ceph is all about allowing operators freedom to architect clusters based on their needs.  In a deployment where the operator wishes to have Ceph assort S3 objects to appropriate media by S3 object size, they make the decision re whether they want that head/small object pool to be HDD or something like TLC.  I would imagine that the latter would be more common, but that’s up to them not me.

> I see there a way of complicating things for future development with little outcome over time when there is no good prediction possible for the use cases hitting a then given h/w configuration. 

The outcome over time is substantial and twofold:

* Deployments can benefit from substantial operational, performance, and RU density improvements compared to HDD at 25+% less cost/TB than TLC.  And depending on device size, potentially double the density of TLC too, which factors into TCO.
* Without sacrificing a large fraction of that cost savings to space amplification and reduced durability

> Also, I would see there a need for additional reporting capabilities to properly reflect the utilization of such specialized pools (and sets of h/w based on CRUSH map) to allow alerting for needed space upgrades. While we had no good differentiation for the existing device classes and CRUSH map structure, by introducing those specialized ones we'll getting more complex here.

That’s one of many reasons why putting it at the OSD layer *isn’t* the way to go.  With separate pools, the operator runs `ceph df detail` just like today and sees the utilization just like we always have.  Capacity management is a matter of expanding one or the other pool, without having to redeploy OSDs, just like we’ve always done. 

> - for the handling of the devices from an OSD perspective:
> 
> Introducing special handling for special devices might be tricky for future device types and would require some kind of automation, at least to detect those. It not only introduces special devices in terms of performance but also in specialized handling; the question would be how to handle future h/w specialized on something else - do we want to rely on the availability of those in the future or wouldn't it be better to have a generic way of adapting to it ?
> Manual tuning of OSDs towards such special kinds of h/w features might be an acceptable way to capitalize on such things - it would acknowledge that those different behaviors exist and could be respected but wouldn't necessarily require special handling in code tied to this special h/w. 

No specialized handling of devices is needed or planned here. 

> Introducing a manual tuning of the min_alloc_size per OSD during the configuration could help to avoid such kind of specialized h/w but would also put the burden of setting this appropriately to the admin introducing the special device for a reason.

Sage asked me to separately have min_alloc_size set automatically at OSD creation, I’m working on that independently.  The kernel provides an optimal_io_size attribute that we can leverage.  If it’s 4k or not present, or the behavior is disabled, nothing changes.

> Then, setting the min_alloc_size to something special would change also the performance needs for RocksDB/WAL depending on the config  chosen

When RocksDB is colocated with the OSD, sure, but this is a second-order concern.  Maybe the operator specifies RocksDB block_size and universal compaction, or maybe they don’t worry about it.  If we do nothing, we’re still way better off than HDD.  This is being done successfully today in production.

> - this could be a manual decision here but in contrast automation would need to add special code to detect it which needs additional maintenance. Also, I see there a lot of special tuning and investigation needed that is only geared to support those devices if tied to the device type. Although manual tuning opens a different set of degrees of performance tuning challenges, it also allows to base this on best practices (to be documented) but also a way to tweak this when not appropriately matching without immediate code changes.
> We changed the way of allocation for HDD in favor of having better capacity utilization on the devices - however, this changes in mixed environments the usability of HDD for fully concurrent mid-sized IOs which is not always a good match, especially for upgrades to existing clusters. A manual tuning could mitigate this as well for any device with special capabilities in data handling, like the new devices discussed.

I don’t think any automation code needs to be added.  Operators who opt in create the supplemental pool and adjust RGW configuration to match.

Tuning has always been a thing and always will be, this doesn’t change that.  And again, it’s optional.

> 
> 
>> 
>> There was also some talk in the meeting about using the LUA framework to steer objects w/o any user intervention - that's something interesting too that I'll look at.
>> 
>> Any other suggestions or pointers are appreciated!
>> 
>> Thanks,
>> - Curt
>> 
>> On Wed, Aug 18, 2021 at 2:10 PM Mark Nelson <mnelson@xxxxxxxxxx <mailto:mnelson@xxxxxxxxxx>> wrote:
>> 
>> 
>>   On 8/18/21 3:52 PM, Casey Bodley wrote:
>>> On Wed, Aug 18, 2021 at 4:20 PM Mark Nelson <mnelson@xxxxxxxxxx
>>   <mailto:mnelson@xxxxxxxxxx>> wrote:
>>>> Hi Casey,
>>>> 
>>>> 
>>>> A while back Igor refactored the code in bluestore to allow us
>>   to have
>>>> small min_alloc sizes on HDDs without a significant performance
>>   penalty
>>>> (this was really great work btw Igor!).  The default now is a 4k
>>>> min_alloc_size on both NVMe and HDD:
>>>> 
>>>> 
>>   https://github.com/ceph/ceph/blob/master/src/common/options/global.yaml.in#L4254-L4284
>>   <https://github.com/ceph/ceph/blob/master/src/common/options/global.yaml.in#L4254-L4284>
>>>> 
>>>> 
>>>> There was a bug causing part of this change to increase write
>>>> amplification dramatically on the DB/WAL device, but this has been
>>>> (mostly) fixed as of last week.  It will still likely be
>>   somewhat higher
>>>> than in Nautilus (not clear yet how much this is due to more
>>   metadata vs
>>>> unnecessary deferred write flushing/compaction), but the space
>>>> amplification benefit is very much worth it.
>>>> 
>>>> 
>>>> Mark
>>> thanks Mark! sorry i didn't capture much of the background here.
>>   we've
>>> been working on this with Anthony from Intel (cc'ed), who summarized
>>> it this way:
>>> 
>>> * Coarse IU QLC SSDs are an appealing alternative to HDDs for Ceph,
>>> notably for RGW bucket data
>>> * BlueStore’s min_alloc_size is best aligned to the IU for
>>   performance
>>> and endurance; today that means 16KB or 64KB depending on the drive
>>> model
>>> * That means that small RGW objects can waste a significant
>>   amount of
>>> space, especially when EC is used
>>> * Multiple bucket data pools with appropriate media can house
>>   small vs
>>> large objects via StorageClasses, but today this requires consistent
>>> user action, which is often infeasible.
>>> 
>>> so the goal isn't to reduce the alloc size for small objects, but to
>>> increase it for the large objects
>> 
>> 
>>   Ah!  That makes sense. So to play the devil's advocate: If you
>>   have some
>>   combination of bulk QLC and a smaller amount of fast high endurance
>>   storage for WAL/DB, could something like dm-cache or opencas (or if
>>   necessarily modifications to bluefs) potentially serve the same
>>   purpose
>>   without doubling the number of pools required?
>> 
>>   _______________________________________________
>>   Dev mailing list -- dev@xxxxxxx <mailto:dev@xxxxxxx>
>>   To unsubscribe send an email to dev-leave@xxxxxxx
>>   <mailto:dev-leave@xxxxxxx>
>> 
> 

_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx