Re: Bluestore min_alloc size space amplification cheatsheet

Igor Fedotov <ifedotov@xxxxxxx> · Fri, 22 Nov 2019 14:01:45 +0300

    On 11/22/2019 12:50 AM, Sage Weil
      wrote:

      On Thu, 21 Nov 2019, Mark Nelson wrote:

        Hi Folks,

We're discussing changing the minimum allocation size in bluestore to 4k.  For
flash devices this appears to be a no-brainer.  We've made the write path fast
enough in bluestore that we're typically seeing either the same or faster
performance with a 4K min_alloc size and the space savings for small objects
easily outweigh the increase in metadata for large fragmented objects.

For HDDs there are tradeoffs.  A smaller allocation size means more
fragmentation when there are small overwrites (like in RBD) which can mean a
lot more seeks.  Igor was showing some fairly steep RBD performance drops for
medium-large reads/writes once the OSDs started to become fragmented.  For RGW
this isn't nearly as big of a deal though since typically the objects
shouldn't become fragmented.  A small (4K) allocation size does mean however
that we can write out 4K random writes sequentially and gain a big IOPS win
which theoretically should benefit both RBD and RGW.

Regarding space-amplification, Josh pointed out that our current 64K
allocation size has huge ramifications for overall space-amp when writing out
medium sized objects to EC pools.  In an attempt to actually quantify this, I
made a spreadsheet with some graphs showing a couple of examples of how the
min_alloc size and replication/EC interact with each other at different object
sizes.  The gist of it is that with our current default HDD min_alloc size
(64K), erasure coding can actually have worse space amplification than 3X
replication, even with moderately large (128K) object sizes.  How much this
factors into the decision vs fragmentation is a tough call, but I wanted to at
least showcase the behavior as we work through deciding what our default HDD
behavior should be.

https://docs.google.com/spreadsheets/d/1rpGfScgG-GLoIGMJWDixEkqs-On9w8nAUToPQjN8bDI/edit?usp=sharing

      The key difference (at the bluestore level) between RGW and RBD/CephFS 
writes is that RGW passes down the CEPH_OSD_ALLOC_HINT_FLAG_IMMUTABLE | 
CEPH_OSD_ALLOC_HINT_FLAG_APPEND_ONLY hints.  The immutable one in 
particular is what we really care about, since it's the mutable objects 
that get overwrites that lead to (most) fragmentation.  We should use this 
to decide whether to create minimal (min_alloc_size) blobs or whether we 
should keep the blobs larger to limit fragmentation.

I'm not sure what we would call the config option that isn't super 
confusing... maybe bluestore_mutable_min_blob_size? 
bluestore_baseline_min_blob_size?

    I'm thinking about introducing  the following behavior for
      spinner-based stores (preliminary notes for now):
    1) Unconditionally pin allocator's granularity to 4K (block
      size), i.e. untie it from min_alloc_size setting
    This will probably result in additional overhead in allocator
      when one needs to allocate contiguous 64K blobs (see 3) below).
    But it looks like new avl allocator is great in dealing with
      that.

    2) Allocate space using 4K block size for small (<=32K or
    <=48K or <60K? ) objects  tagged with IMMUTABLE flag 

    3) Use existing min_alloc_size defaults (64K) for all(?) other
      objects 

    4) Tag object with alloc size it's using.

    ------- The above is enough enough to deal with RGW space
      amplification for small objects which is the worst case for now
      IMO.----------------

    Additional points to consider:
    5) add per-object (more flexible and complex) hint or per-store
      setting to optimize onode keeping for size vs. for speed. May be
      even spit the latter into optimize for reads vs. optimize for
      overwrites?

    EC  and/or other concerned entries to indicate
      "OPTIMIZE_FOR_SPACE". Unconditionally or using some logics?

    6) apply appropriate alloc size depending on the strategy
      determined at 5)
    7)  Even more versatile approach would be vary alloc size at onode's
    blob level. Hence tag blob rather than object with applied alloc
    size (needs single bit in fact).
    May be useful to deal with space amplification for tail blobs or
      sparse objects. 

    Thoughts?

    Thanks,
    Igor

      sage

      _______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx

_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx