On Tue, 14 Jun 2016, Allen Samuels wrote: > For flash what we want to do is leave min_alloc_size at 4K and figure > out how to shrink the oNode so that the KV commit fits into a minimal > number of writes. > > There are two obvious things to do w.r.t. shrinking the oNode size: > > (1) sophisticated encode/decode function. I've talked about this before, > hopefully I'll have more time to dig into this shortly. > > (2) Reducing the stripe size. A larger stripe size tends to improve > sequential read/write speeds when the application is doing large I/O > operations (less I/O fracturing). It will also reduce metadata size by > amortizing the fixed size of an oNode (i.e., the stuff in an oNode that > doesn't scale with the object size) across fewer oNodes. Both of these > phenomenon provide decreasing benefits as the stripe size increases. > However, larger oNodes cost more to read/write them for random I/O > operations. I believe that for flash, the current default stripe size of > 4MB is too large in that the gains for sequential operations are minimal > and the penalty on random operations is too large... This believe should > be subjected to experimental verification AFTER we've shrunk the oNode > using (1). It's also possible that the optimal stripe size (for flash) > is HW dependent -- since the variance in performance characteristics > between different flash devices can be rather large. Agreed on both of these. Not mutually exclusive with (3), though: increase blob size via larger min_alloc_size. 4K random write benchmark write-amp aside, I still think we may end up with an onode size where the lower write latency and half to quarter-size lextent/blob map reduces metadata compaction overhead enough to offset the larger initial txn sizes. We'll see when we benchmark. sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html