Bluestore min_alloc size space amplification cheatsheet

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Folks,


We're discussing changing the minimum allocation size in bluestore to 4k.  For flash devices this appears to be a no-brainer.  We've made the write path fast enough in bluestore that we're typically seeing either the same or faster performance with a 4K min_alloc size and the space savings for small objects easily outweigh the increase in metadata for large fragmented objects.

For HDDs there are tradeoffs.  A smaller allocation size means more fragmentation when there are small overwrites (like in RBD) which can mean a lot more seeks.  Igor was showing some fairly steep RBD performance drops for medium-large reads/writes once the OSDs started to become fragmented.  For RGW this isn't nearly as big of a deal though since typically the objects shouldn't become fragmented.  A small (4K) allocation size does mean however that we can write out 4K random writes sequentially and gain a big IOPS win which theoretically should benefit both RBD and RGW.

Regarding space-amplification, Josh pointed out that our current 64K allocation size has huge ramifications for overall space-amp when writing out medium sized objects to EC pools.  In an attempt to actually quantify this, I made a spreadsheet with some graphs showing a couple of examples of how the min_alloc size and replication/EC interact with each other at different object sizes.  The gist of it is that with our current default HDD min_alloc size (64K), erasure coding can actually have worse space amplification than 3X replication, even with moderately large (128K) object sizes.  How much this factors into the decision vs fragmentation is a tough call, but I wanted to at least showcase the behavior as we work through deciding what our default HDD behavior should be.


https://docs.google.com/spreadsheets/d/1rpGfScgG-GLoIGMJWDixEkqs-On9w8nAUToPQjN8bDI/edit?usp=sharing


Thanks,

Mark
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx




[Index of Archives]     [CEPH Users]     [Ceph Devel]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux