I'd support a smaller default, if not only for immutable. We did some testing of "small" 64KB objects on a 4+2 pool, and the amp was clearly a huge issue. On Thu, Nov 21, 2019 at 1:50 PM Sage Weil <sage@xxxxxxxxxxxx> wrote: > > On Thu, 21 Nov 2019, Mark Nelson wrote: > > Hi Folks, > > > > > > We're discussing changing the minimum allocation size in bluestore to 4k. For > > flash devices this appears to be a no-brainer. We've made the write path fast > > enough in bluestore that we're typically seeing either the same or faster > > performance with a 4K min_alloc size and the space savings for small objects > > easily outweigh the increase in metadata for large fragmented objects. > > > > For HDDs there are tradeoffs. A smaller allocation size means more > > fragmentation when there are small overwrites (like in RBD) which can mean a > > lot more seeks. Igor was showing some fairly steep RBD performance drops for > > medium-large reads/writes once the OSDs started to become fragmented. For RGW > > this isn't nearly as big of a deal though since typically the objects > > shouldn't become fragmented. A small (4K) allocation size does mean however > > that we can write out 4K random writes sequentially and gain a big IOPS win > > which theoretically should benefit both RBD and RGW. > > > > Regarding space-amplification, Josh pointed out that our current 64K > > allocation size has huge ramifications for overall space-amp when writing out > > medium sized objects to EC pools. In an attempt to actually quantify this, I > > made a spreadsheet with some graphs showing a couple of examples of how the > > min_alloc size and replication/EC interact with each other at different object > > sizes. The gist of it is that with our current default HDD min_alloc size > > (64K), erasure coding can actually have worse space amplification than 3X > > replication, even with moderately large (128K) object sizes. How much this > > factors into the decision vs fragmentation is a tough call, but I wanted to at > > least showcase the behavior as we work through deciding what our default HDD > > behavior should be. > > > > > > https://docs.google.com/spreadsheets/d/1rpGfScgG-GLoIGMJWDixEkqs-On9w8nAUToPQjN8bDI/edit?usp=sharing > > The key difference (at the bluestore level) between RGW and RBD/CephFS > writes is that RGW passes down the CEPH_OSD_ALLOC_HINT_FLAG_IMMUTABLE | > CEPH_OSD_ALLOC_HINT_FLAG_APPEND_ONLY hints. The immutable one in > particular is what we really care about, since it's the mutable objects > that get overwrites that lead to (most) fragmentation. We should use this > to decide whether to create minimal (min_alloc_size) blobs or whether we > should keep the blobs larger to limit fragmentation. > > I'm not sure what we would call the config option that isn't super > confusing... maybe bluestore_mutable_min_blob_size? > bluestore_baseline_min_blob_size? > > sage_______________________________________________ > Dev mailing list -- dev@xxxxxxx > To unsubscribe send an email to dev-leave@xxxxxxx _______________________________________________ Dev mailing list -- dev@xxxxxxx To unsubscribe send an email to dev-leave@xxxxxxx