Re: rgw: matching small objects to pools with small min_alloc_size

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Mark / all,

I'm working with Anthony on this QLC optimization idea at Intel and plan on doing the testing/dev work, if needed.  Thanks for playing devil's advocate - we're brainstorming all ideas right now.

The problem as I understand it is having the data pool on larger Indirection Unit (e.g. 64k) QLC drives and when a user PUTs a lot of small objects (<< 64k) into this pool, it leads to lots of wasted space.  Ideally we would know an object's size on upload so we could steer objects >= IU to the QLC pool and then smaller objects to a TLC pool.  But, since we don't know the object size early enough in time, the suggestion is to put HEAD objects into a TLC-backed pool, which could handle the smaller objects, and then put tails, which implies a larger object, into the QLC-backed pool.  This is ideally without having the end S3 user having to configure anything.

Your devil's advocate idea is along those lines, I believe, of placing the small objects into cache and then larger objects into the backing storage device, I believe?  I don't know the exact implementation of Open-CAS, but caching to me implies temporal storage and if it's the small objects getting cached, they'll eventually get evicted/flushed out to the QLC pool which then causes our small object problem of wasting QLC space?

There was also some talk in the meeting about using the LUA framework to steer objects w/o any user intervention - that's something interesting too that I'll look at.

Any other suggestions or pointers are appreciated!

Thanks,
- Curt

On Wed, Aug 18, 2021 at 2:10 PM Mark Nelson <mnelson@xxxxxxxxxx> wrote:

On 8/18/21 3:52 PM, Casey Bodley wrote:
> On Wed, Aug 18, 2021 at 4:20 PM Mark Nelson <mnelson@xxxxxxxxxx> wrote:
>> Hi Casey,
>>
>>
>> A while back Igor refactored the code in bluestore to allow us to have
>> small min_alloc sizes on HDDs without a significant performance penalty
>> (this was really great work btw Igor!).  The default now is a 4k
>> min_alloc_size on both NVMe and HDD:
>>
>> https://github.com/ceph/ceph/blob/master/src/common/options/global.yaml.in#L4254-L4284
>>
>>
>> There was a bug causing part of this change to increase write
>> amplification dramatically on the DB/WAL device, but this has been
>> (mostly) fixed as of last week.  It will still likely be somewhat higher
>> than in Nautilus (not clear yet how much this is due to more metadata vs
>> unnecessary deferred write flushing/compaction), but the space
>> amplification benefit is very much worth it.
>>
>>
>> Mark
> thanks Mark! sorry i didn't capture much of the background here. we've
> been working on this with Anthony from Intel (cc'ed), who summarized
> it this way:
>
> * Coarse IU QLC SSDs are an appealing alternative to HDDs for Ceph,
> notably for RGW bucket data
> * BlueStore’s min_alloc_size is best aligned to the IU for performance
> and endurance; today that means 16KB or 64KB depending on the drive
> model
> * That means that small RGW objects can waste a significant amount of
> space, especially when EC is used
> * Multiple bucket data pools with appropriate media can house small vs
> large objects via StorageClasses, but today this requires consistent
> user action, which is often infeasible.
>
> so the goal isn't to reduce the alloc size for small objects, but to
> increase it for the large objects


Ah!  That makes sense. So to play the devil's advocate: If you have some
combination of bulk QLC and a smaller amount of fast high endurance
storage for WAL/DB, could something like dm-cache or opencas (or if
necessarily modifications to bluefs) potentially serve the same purpose
without doubling the number of pools required?

_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx

[Index of Archives]     [CEPH Users]     [Ceph Devel]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux