Hi Mark, I can confirm that EC+HDD space amplification with the 64k min_alloc_size is quite real. On one of our first bluestore cluster we measured amplification at ~30% of the used space.This is also problematic for filestore to bluestore repaves on failures. Below are two OSDs with the same number of PGs, one on bluestore 64k, one on filestore: ID CLASS WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR PGS 137 hdd 7.26599 1.00000 7440G 4193G 3246G 56.36 1.14 126 218 hdd 7.26599 1.00000 7451G 6170G 1281G 82.81 1.68 126 Since then, we changed min_alloc_size to 16k and saw some good results. A 24h benchmark that wrote about 1TB of small objects a single OSD pool didn't show a significant performance difference between 64 and 16k. min_alloc_size 16k vs 64k on the same cluster (not the same as above): ID CLASS WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR PGS 221 hdd 7.27699 1.00000 7451G 3193G 4258G 42.85 0.78 108 579 hdd 7.27698 1.00000 7451G 4272G 3179G 57.33 1.05 108 On Thu, Nov 21, 2019 at 1:50 PM Sage Weil <sage@xxxxxxxxxxxx> wrote: > > On Thu, 21 Nov 2019, Mark Nelson wrote: > > Hi Folks, > > > > > > We're discussing changing the minimum allocation size in bluestore to 4k. For > > flash devices this appears to be a no-brainer. We've made the write path fast > > enough in bluestore that we're typically seeing either the same or faster > > performance with a 4K min_alloc size and the space savings for small objects > > easily outweigh the increase in metadata for large fragmented objects. > > > > For HDDs there are tradeoffs. A smaller allocation size means more > > fragmentation when there are small overwrites (like in RBD) which can mean a > > lot more seeks. Igor was showing some fairly steep RBD performance drops for > > medium-large reads/writes once the OSDs started to become fragmented. For RGW > > this isn't nearly as big of a deal though since typically the objects > > shouldn't become fragmented. A small (4K) allocation size does mean however > > that we can write out 4K random writes sequentially and gain a big IOPS win > > which theoretically should benefit both RBD and RGW. > > > > Regarding space-amplification, Josh pointed out that our current 64K > > allocation size has huge ramifications for overall space-amp when writing out > > medium sized objects to EC pools. In an attempt to actually quantify this, I > > made a spreadsheet with some graphs showing a couple of examples of how the > > min_alloc size and replication/EC interact with each other at different object > > sizes. The gist of it is that with our current default HDD min_alloc size > > (64K), erasure coding can actually have worse space amplification than 3X > > replication, even with moderately large (128K) object sizes. How much this > > factors into the decision vs fragmentation is a tough call, but I wanted to at > > least showcase the behavior as we work through deciding what our default HDD > > behavior should be. > > > > > > https://docs.google.com/spreadsheets/d/1rpGfScgG-GLoIGMJWDixEkqs-On9w8nAUToPQjN8bDI/edit?usp=sharing > > The key difference (at the bluestore level) between RGW and RBD/CephFS > writes is that RGW passes down the CEPH_OSD_ALLOC_HINT_FLAG_IMMUTABLE | > CEPH_OSD_ALLOC_HINT_FLAG_APPEND_ONLY hints. The immutable one in > particular is what we really care about, since it's the mutable objects > that get overwrites that lead to (most) fragmentation. We should use this > to decide whether to create minimal (min_alloc_size) blobs or whether we > should keep the blobs larger to limit fragmentation. > > I'm not sure what we would call the config option that isn't super > confusing... maybe bluestore_mutable_min_blob_size? > bluestore_baseline_min_blob_size? > > sage_______________________________________________ > Dev mailing list -- dev@xxxxxxx > To unsubscribe send an email to dev-leave@xxxxxxx _______________________________________________ Dev mailing list -- dev@xxxxxxx To unsubscribe send an email to dev-leave@xxxxxxx