Re: Bluestore min_alloc size space amplification cheatsheet

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Question: lower alloc sizes seem like a work-around. Would it be
better to have an option to store small objects entirely in rocksdb?

I've seen a lot of setups that would really benefit from such a
feature. It's usually a case of mixed object sizes with a lot of very
small (< 10kb) objects and some large objects that unfortunately can't
be mapped to different directories/buckets/pools on a higher layer.

(I think I did see some pull request that did something like that some
time ago?)


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Fri, Nov 22, 2019 at 3:08 PM Igor Fedotov <ifedotov@xxxxxxx> wrote:
>
>
> On 11/22/2019 2:01 PM, Igor Fedotov wrote:
>
>
> On 11/22/2019 12:50 AM, Sage Weil wrote:
>
> The key difference (at the bluestore level) between RGW and RBD/CephFS
>
> writes is that RGW passes down the CEPH_OSD_ALLOC_HINT_FLAG_IMMUTABLE |
> CEPH_OSD_ALLOC_HINT_FLAG_APPEND_ONLY hints.  The immutable one in
> particular is what we really care about, since it's the mutable objects
> that get overwrites that lead to (most) fragmentation.  We should use this
> to decide whether to create minimal (min_alloc_size) blobs or whether we
> should keep the blobs larger to limit fragmentation.
>
> I'm not sure what we would call the config option that isn't super
> confusing... maybe bluestore_mutable_min_blob_size?
> bluestore_baseline_min_blob_size?
>
> I'm thinking about introducing  the following behavior for spinner-based stores (preliminary notes for now):
>
> 1) Unconditionally pin allocator's granularity to 4K (block size), i.e. untie it from min_alloc_size setting
>
> This will probably result in additional overhead in allocator when one needs to allocate contiguous 64K blobs (see 3) below).
>
> But it looks like new avl allocator is great in dealing with that.
>
> 2) Allocate space using 4K block size for small (<=32K or <=48K or <60K? ) objects  tagged with IMMUTABLE flag
>
> 3) Use existing min_alloc_size defaults (64K) for all(?) other objects
>
> 4) Tag object with alloc size it's using.
>
> ------- The above is enough enough to deal with RGW space amplification for small objects which is the worst case for now IMO.----------------
>
> Additional points to consider:
>
> 5) add per-object (more flexible and complex) hint or per-store setting to optimize onode keeping for size vs. for speed. May be even spit the latter into optimize for reads vs. optimize for overwrites?
>
> One can read 'per-pool' instead of 'per-store' above.
>
>
> EC  and/or other concerned entries to indicate "OPTIMIZE_FOR_SPACE". Unconditionally or using some logics?
>
> 6) apply appropriate alloc size depending on the strategy determined at 5)
>
> 7)  Even more versatile approach would be vary alloc size at onode's blob level. Hence tag blob rather than object with applied alloc size (needs single bit in fact).
>
> May be useful to deal with space amplification for tail blobs or sparse objects.
>
>
> Thoughts?
>
>
> Thanks,
>
> Igor
>
> sage
>
>
> _______________________________________________
> Dev mailing list -- dev@xxxxxxx
> To unsubscribe send an email to dev-leave@xxxxxxx
>
>
> _______________________________________________
> Dev mailing list -- dev@xxxxxxx
> To unsubscribe send an email to dev-leave@xxxxxxx
>
> _______________________________________________
> Dev mailing list -- dev@xxxxxxx
> To unsubscribe send an email to dev-leave@xxxxxxx
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx




[Index of Archives]     [CEPH Users]     [Ceph Devel]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux