Re: Bluestore min_alloc size space amplification cheatsheet

Kyle Bader <kyle.bader@xxxxxxxxx> · Thu, 21 Nov 2019 15:15:31 -0800

I'd support a smaller default, if not only for immutable. We did some
testing of "small" 64KB objects on a 4+2 pool, and the amp was clearly
a huge issue.

On Thu, Nov 21, 2019 at 1:50 PM Sage Weil <sage@xxxxxxxxxxxx> wrote:
>
> On Thu, 21 Nov 2019, Mark Nelson wrote:
> > Hi Folks,
> >
> >
> > We're discussing changing the minimum allocation size in bluestore to 4k.  For
> > flash devices this appears to be a no-brainer.  We've made the write path fast
> > enough in bluestore that we're typically seeing either the same or faster
> > performance with a 4K min_alloc size and the space savings for small objects
> > easily outweigh the increase in metadata for large fragmented objects.
> >
> > For HDDs there are tradeoffs.  A smaller allocation size means more
> > fragmentation when there are small overwrites (like in RBD) which can mean a
> > lot more seeks.  Igor was showing some fairly steep RBD performance drops for
> > medium-large reads/writes once the OSDs started to become fragmented.  For RGW
> > this isn't nearly as big of a deal though since typically the objects
> > shouldn't become fragmented.  A small (4K) allocation size does mean however
> > that we can write out 4K random writes sequentially and gain a big IOPS win
> > which theoretically should benefit both RBD and RGW.
> >
> > Regarding space-amplification, Josh pointed out that our current 64K
> > allocation size has huge ramifications for overall space-amp when writing out
> > medium sized objects to EC pools.  In an attempt to actually quantify this, I
> > made a spreadsheet with some graphs showing a couple of examples of how the
> > min_alloc size and replication/EC interact with each other at different object
> > sizes.  The gist of it is that with our current default HDD min_alloc size
> > (64K), erasure coding can actually have worse space amplification than 3X
> > replication, even with moderately large (128K) object sizes.  How much this
> > factors into the decision vs fragmentation is a tough call, but I wanted to at
> > least showcase the behavior as we work through deciding what our default HDD
> > behavior should be.
> >
> >
> > https://docs.google.com/spreadsheets/d/1rpGfScgG-GLoIGMJWDixEkqs-On9w8nAUToPQjN8bDI/edit?usp=sharing
>
> The key difference (at the bluestore level) between RGW and RBD/CephFS
> writes is that RGW passes down the CEPH_OSD_ALLOC_HINT_FLAG_IMMUTABLE |
> CEPH_OSD_ALLOC_HINT_FLAG_APPEND_ONLY hints.  The immutable one in
> particular is what we really care about, since it's the mutable objects
> that get overwrites that lead to (most) fragmentation.  We should use this
> to decide whether to create minimal (min_alloc_size) blobs or whether we
> should keep the blobs larger to limit fragmentation.
>
> I'm not sure what we would call the config option that isn't super
> confusing... maybe bluestore_mutable_min_blob_size?
> bluestore_baseline_min_blob_size?
>
> sage_______________________________________________
> Dev mailing list -- dev@xxxxxxx
> To unsubscribe send an email to dev-leave@xxxxxxx
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx