Hi Folks,
We're discussing changing the minimum allocation size in bluestore to
4k. For flash devices this appears to be a no-brainer. We've made the
write path fast enough in bluestore that we're typically seeing either
the same or faster performance with a 4K min_alloc size and the space
savings for small objects easily outweigh the increase in metadata for
large fragmented objects.
For HDDs there are tradeoffs. A smaller allocation size means more
fragmentation when there are small overwrites (like in RBD) which can
mean a lot more seeks. Igor was showing some fairly steep RBD
performance drops for medium-large reads/writes once the OSDs started to
become fragmented. For RGW this isn't nearly as big of a deal though
since typically the objects shouldn't become fragmented. A small (4K)
allocation size does mean however that we can write out 4K random writes
sequentially and gain a big IOPS win which theoretically should benefit
both RBD and RGW.
Regarding space-amplification, Josh pointed out that our current 64K
allocation size has huge ramifications for overall space-amp when
writing out medium sized objects to EC pools. In an attempt to actually
quantify this, I made a spreadsheet with some graphs showing a couple of
examples of how the min_alloc size and replication/EC interact with each
other at different object sizes. The gist of it is that with our
current default HDD min_alloc size (64K), erasure coding can actually
have worse space amplification than 3X replication, even with moderately
large (128K) object sizes. How much this factors into the decision vs
fragmentation is a tough call, but I wanted to at least showcase the
behavior as we work through deciding what our default HDD behavior
should be.
https://docs.google.com/spreadsheets/d/1rpGfScgG-GLoIGMJWDixEkqs-On9w8nAUToPQjN8bDI/edit?usp=sharing
Thanks,
Mark
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx