Re: rgw: matching small objects to pools with small min_alloc_size

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 8/18/21 3:52 PM, Casey Bodley wrote:
On Wed, Aug 18, 2021 at 4:20 PM Mark Nelson <mnelson@xxxxxxxxxx> wrote:
Hi Casey,


A while back Igor refactored the code in bluestore to allow us to have
small min_alloc sizes on HDDs without a significant performance penalty
(this was really great work btw Igor!).  The default now is a 4k
min_alloc_size on both NVMe and HDD:

https://github.com/ceph/ceph/blob/master/src/common/options/global.yaml.in#L4254-L4284


There was a bug causing part of this change to increase write
amplification dramatically on the DB/WAL device, but this has been
(mostly) fixed as of last week.  It will still likely be somewhat higher
than in Nautilus (not clear yet how much this is due to more metadata vs
unnecessary deferred write flushing/compaction), but the space
amplification benefit is very much worth it.


Mark
thanks Mark! sorry i didn't capture much of the background here. we've
been working on this with Anthony from Intel (cc'ed), who summarized
it this way:

* Coarse IU QLC SSDs are an appealing alternative to HDDs for Ceph,
notably for RGW bucket data
* BlueStore’s min_alloc_size is best aligned to the IU for performance
and endurance; today that means 16KB or 64KB depending on the drive
model
* That means that small RGW objects can waste a significant amount of
space, especially when EC is used
* Multiple bucket data pools with appropriate media can house small vs
large objects via StorageClasses, but today this requires consistent
user action, which is often infeasible.

so the goal isn't to reduce the alloc size for small objects, but to
increase it for the large objects


Ah!  That makes sense. So to play the devil's advocate: If you have some combination of bulk QLC and a smaller amount of fast high endurance storage for WAL/DB, could something like dm-cache or opencas (or if necessarily modifications to bluefs) potentially serve the same purpose without doubling the number of pools required?

_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx




[Index of Archives]     [CEPH Users]     [Ceph Devel]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux