Question: lower alloc sizes seem like a work-around. Would it be better to have an option to store small objects entirely in rocksdb? I've seen a lot of setups that would really benefit from such a feature. It's usually a case of mixed object sizes with a lot of very small (< 10kb) objects and some large objects that unfortunately can't be mapped to different directories/buckets/pools on a higher layer. (I think I did see some pull request that did something like that some time ago?) Paul -- Paul Emmerich Looking for help with your Ceph cluster? Contact us at https://croit.io croit GmbH Freseniusstr. 31h 81247 München www.croit.io Tel: +49 89 1896585 90 On Fri, Nov 22, 2019 at 3:08 PM Igor Fedotov <ifedotov@xxxxxxx> wrote: > > > On 11/22/2019 2:01 PM, Igor Fedotov wrote: > > > On 11/22/2019 12:50 AM, Sage Weil wrote: > > The key difference (at the bluestore level) between RGW and RBD/CephFS > > writes is that RGW passes down the CEPH_OSD_ALLOC_HINT_FLAG_IMMUTABLE | > CEPH_OSD_ALLOC_HINT_FLAG_APPEND_ONLY hints. The immutable one in > particular is what we really care about, since it's the mutable objects > that get overwrites that lead to (most) fragmentation. We should use this > to decide whether to create minimal (min_alloc_size) blobs or whether we > should keep the blobs larger to limit fragmentation. > > I'm not sure what we would call the config option that isn't super > confusing... maybe bluestore_mutable_min_blob_size? > bluestore_baseline_min_blob_size? > > I'm thinking about introducing the following behavior for spinner-based stores (preliminary notes for now): > > 1) Unconditionally pin allocator's granularity to 4K (block size), i.e. untie it from min_alloc_size setting > > This will probably result in additional overhead in allocator when one needs to allocate contiguous 64K blobs (see 3) below). > > But it looks like new avl allocator is great in dealing with that. > > 2) Allocate space using 4K block size for small (<=32K or <=48K or <60K? ) objects tagged with IMMUTABLE flag > > 3) Use existing min_alloc_size defaults (64K) for all(?) other objects > > 4) Tag object with alloc size it's using. > > ------- The above is enough enough to deal with RGW space amplification for small objects which is the worst case for now IMO.---------------- > > Additional points to consider: > > 5) add per-object (more flexible and complex) hint or per-store setting to optimize onode keeping for size vs. for speed. May be even spit the latter into optimize for reads vs. optimize for overwrites? > > One can read 'per-pool' instead of 'per-store' above. > > > EC and/or other concerned entries to indicate "OPTIMIZE_FOR_SPACE". Unconditionally or using some logics? > > 6) apply appropriate alloc size depending on the strategy determined at 5) > > 7) Even more versatile approach would be vary alloc size at onode's blob level. Hence tag blob rather than object with applied alloc size (needs single bit in fact). > > May be useful to deal with space amplification for tail blobs or sparse objects. > > > Thoughts? > > > Thanks, > > Igor > > sage > > > _______________________________________________ > Dev mailing list -- dev@xxxxxxx > To unsubscribe send an email to dev-leave@xxxxxxx > > > _______________________________________________ > Dev mailing list -- dev@xxxxxxx > To unsubscribe send an email to dev-leave@xxxxxxx > > _______________________________________________ > Dev mailing list -- dev@xxxxxxx > To unsubscribe send an email to dev-leave@xxxxxxx _______________________________________________ Dev mailing list -- dev@xxxxxxx To unsubscribe send an email to dev-leave@xxxxxxx