Re: rgw: matching small objects to pools with small min_alloc_size

Mark Nelson <mnelson@xxxxxxxxxx> · Thu, 19 Aug 2021 02:48:30 -0500

On 8/18/21 6:50 PM, Curt Bruns wrote:
Hi Mark / all,

I'm working with Anthony on this QLC optimization idea at Intel and 
plan on doing the testing/dev work, if needed. Thanks for playing 
devil's advocate - we're brainstorming all ideas right now.

The problem as I understand it is having the data pool on larger 
Indirection Unit (e.g. 64k) QLC drives and when a user PUTs a lot of 
small objects (<< 64k) into this pool, it leads to lots of wasted 
space.  Ideally we would know an object's size on upload so we could 
steer objects >= IU to the QLC pool and then smaller objects to a TLC 
pool.  But, since we don't know the object size early enough in time, 
the suggestion is to put HEAD objects into a TLC-backed pool, which 
could handle the smaller objects, and then put tails, which implies a 
larger object, into the QLC-backed pool.  This is ideally without 
having the end S3 user having to configure anything.

Yep, with larger min_alloc sizes you'll be increasing space-amp for 
small objects.  Definitely a cause for concern!  So regarding the idea:  
We've been getting requests lately to increase the overall pool limits 
in ceph.  Given that these devices are designed to store vast quantities 
of data, I don't think we should take the 2x pool multiplier requirement 
lightly.  It seems like the people that want to use these huge QLC 
drives may be the same people that want lots of pools.  Generally 
speaking I'm hoping we can actually reduce the number of pools RGW needs 
rather than increasing it.  I'm also a little concerned that splitting 
data across pools at the RGW layer is introducing a lot of complexity 
higher up that could be handled down closer to the disk. Sometimes it's 
inevitable that details like device indirection unit get exposed higher 
up the stack, but in this case I'm not sure we really need to.

Your devil's advocate idea is along those lines, I believe, of placing 
the small objects into cache and then larger objects into the backing 
storage device, I believe?  I don't know the exact implementation of 
Open-CAS, but caching to me implies temporal storage and if it's the 
small objects getting cached, they'll eventually get evicted/flushed 
out to the QLC pool which then causes our small object problem of 
wasting QLC space?

One of the questions I had for the opencas guys a while back was whether 
or not you could provide hints to pin specific data to the fast device 
(say rocksdb L0 if we just wanted to have opencas handle the entire 
device instead of having bluefs use a separate partions for DB/WAL).  I 
believe Intel said this is possible, potentially by tagging a specific 
region of the block device as a pinned area.  Maybe one of the opencas 
folks from Intel can chime in here.  So potentially the way this could 
work would be the OSD looks at the size of the object or gets a hint 
from RGW or something, and then bluestore sees that and via bluefs 
writes the object to specific regions of the opencas backed storage so 
that it stays pinned on optane or pdimm just like the associated rocksdb 
metadata.  Alternately we could teach bluefs itself to write this kind 
of data directly to the fast device and dm-cache/opencas would not 
strictly be necessary.  One of the things I like about this approach is 
that bluefs already has to think about how to manage space between fast 
and slow devices (ie what to do when there's not enough space on the 
fast device to store SST files).  It seems to me that the problem of 
small objects on QLC is very similar.  There may be cases where you want 
to prioritize small objects being pinned on the fast device rather than 
seldomly accessed SST files (say for L2+).  We can better prioritize 
that by handling at the bluefs layer rather than having bluefs and rgw 
using different approaches to fight over who gets to put data on the 
fast devices.

There was also some talk in the meeting about using the LUA framework 
to steer objects w/o any user intervention - that's something 
interesting too that I'll look at.

Any other suggestions or pointers are appreciated!

Thanks,
- Curt

On Wed, Aug 18, 2021 at 2:10 PM Mark Nelson <mnelson@xxxxxxxxxx 
<mailto:mnelson@xxxxxxxxxx>> wrote:

    On 8/18/21 3:52 PM, Casey Bodley wrote:
    > On Wed, Aug 18, 2021 at 4:20 PM Mark Nelson <mnelson@xxxxxxxxxx
    <mailto:mnelson@xxxxxxxxxx>> wrote:
    >> Hi Casey,
    >>
    >>
    >> A while back Igor refactored the code in bluestore to allow us
    to have
    >> small min_alloc sizes on HDDs without a significant performance
    penalty
    >> (this was really great work btw Igor!).  The default now is a 4k
    >> min_alloc_size on both NVMe and HDD:
    >>
    >>
    https://github.com/ceph/ceph/blob/master/src/common/options/global.yaml.in#L4254-L4284
    <https://github.com/ceph/ceph/blob/master/src/common/options/global.yaml.in#L4254-L4284>
    >>
    >>
    >> There was a bug causing part of this change to increase write
    >> amplification dramatically on the DB/WAL device, but this has been
    >> (mostly) fixed as of last week.  It will still likely be
    somewhat higher
    >> than in Nautilus (not clear yet how much this is due to more
    metadata vs
    >> unnecessary deferred write flushing/compaction), but the space
    >> amplification benefit is very much worth it.
    >>
    >>
    >> Mark
    > thanks Mark! sorry i didn't capture much of the background here.
    we've
    > been working on this with Anthony from Intel (cc'ed), who summarized
    > it this way:
    >
    > * Coarse IU QLC SSDs are an appealing alternative to HDDs for Ceph,
    > notably for RGW bucket data
    > * BlueStore’s min_alloc_size is best aligned to the IU for
    performance
    > and endurance; today that means 16KB or 64KB depending on the drive
    > model
    > * That means that small RGW objects can waste a significant
    amount of
    > space, especially when EC is used
    > * Multiple bucket data pools with appropriate media can house
    small vs
    > large objects via StorageClasses, but today this requires consistent
    > user action, which is often infeasible.
    >
    > so the goal isn't to reduce the alloc size for small objects, but to
    > increase it for the large objects

    Ah!  That makes sense. So to play the devil's advocate: If you
    have some
    combination of bulk QLC and a smaller amount of fast high endurance
    storage for WAL/DB, could something like dm-cache or opencas (or if
    necessarily modifications to bluefs) potentially serve the same
    purpose
    without doubling the number of pools required?

    _______________________________________________
    Dev mailing list -- dev@xxxxxxx <mailto:dev@xxxxxxx>
    To unsubscribe send an email to dev-leave@xxxxxxx
    <mailto:dev-leave@xxxxxxx>

_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx