Re: Bluestore min_alloc size space amplification cheatsheet

Alexandre Marangone <a.marangone@xxxxxxxxx> · Thu, 21 Nov 2019 15:21:16 -0800

Hi Mark,

I can confirm that EC+HDD space amplification with the 64k
min_alloc_size is quite real. On one of our first bluestore cluster we
measured amplification at ~30% of the used space.This is also
problematic for filestore to bluestore repaves on failures. Below are
two OSDs with the same number of PGs, one on bluestore 64k, one on
filestore:
ID  CLASS WEIGHT  REWEIGHT SIZE  USE    AVAIL %USE  VAR  PGS
137   hdd 7.26599  1.00000 7440G  4193G 3246G 56.36 1.14 126
218   hdd 7.26599  1.00000 7451G  6170G 1281G 82.81 1.68 126

Since then, we changed min_alloc_size to 16k and saw some good
results. A 24h benchmark that wrote about 1TB of small objects a
single OSD pool didn't show a significant performance difference
between 64 and 16k.

min_alloc_size 16k vs 64k on the same cluster (not the same as above):
ID  CLASS WEIGHT  REWEIGHT SIZE  USE    AVAIL %USE  VAR  PGS
221   hdd 7.27699  1.00000 7451G   3193G 4258G 42.85 0.78 108
579   hdd 7.27698  1.00000 7451G   4272G 3179G 57.33 1.05 108

On Thu, Nov 21, 2019 at 1:50 PM Sage Weil <sage@xxxxxxxxxxxx> wrote:
>
> On Thu, 21 Nov 2019, Mark Nelson wrote:
> > Hi Folks,
> >
> >
> > We're discussing changing the minimum allocation size in bluestore to 4k.  For
> > flash devices this appears to be a no-brainer.  We've made the write path fast
> > enough in bluestore that we're typically seeing either the same or faster
> > performance with a 4K min_alloc size and the space savings for small objects
> > easily outweigh the increase in metadata for large fragmented objects.
> >
> > For HDDs there are tradeoffs.  A smaller allocation size means more
> > fragmentation when there are small overwrites (like in RBD) which can mean a
> > lot more seeks.  Igor was showing some fairly steep RBD performance drops for
> > medium-large reads/writes once the OSDs started to become fragmented.  For RGW
> > this isn't nearly as big of a deal though since typically the objects
> > shouldn't become fragmented.  A small (4K) allocation size does mean however
> > that we can write out 4K random writes sequentially and gain a big IOPS win
> > which theoretically should benefit both RBD and RGW.
> >
> > Regarding space-amplification, Josh pointed out that our current 64K
> > allocation size has huge ramifications for overall space-amp when writing out
> > medium sized objects to EC pools.  In an attempt to actually quantify this, I
> > made a spreadsheet with some graphs showing a couple of examples of how the
> > min_alloc size and replication/EC interact with each other at different object
> > sizes.  The gist of it is that with our current default HDD min_alloc size
> > (64K), erasure coding can actually have worse space amplification than 3X
> > replication, even with moderately large (128K) object sizes.  How much this
> > factors into the decision vs fragmentation is a tough call, but I wanted to at
> > least showcase the behavior as we work through deciding what our default HDD
> > behavior should be.
> >
> >
> > https://docs.google.com/spreadsheets/d/1rpGfScgG-GLoIGMJWDixEkqs-On9w8nAUToPQjN8bDI/edit?usp=sharing
>
> The key difference (at the bluestore level) between RGW and RBD/CephFS
> writes is that RGW passes down the CEPH_OSD_ALLOC_HINT_FLAG_IMMUTABLE |
> CEPH_OSD_ALLOC_HINT_FLAG_APPEND_ONLY hints.  The immutable one in
> particular is what we really care about, since it's the mutable objects
> that get overwrites that lead to (most) fragmentation.  We should use this
> to decide whether to create minimal (min_alloc_size) blobs or whether we
> should keep the blobs larger to limit fragmentation.
>
> I'm not sure what we would call the config option that isn't super
> confusing... maybe bluestore_mutable_min_blob_size?
> bluestore_baseline_min_blob_size?
>
> sage_______________________________________________
> Dev mailing list -- dev@xxxxxxx
> To unsubscribe send an email to dev-leave@xxxxxxx
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx