> -----Original Message----- > From: Sage Weil [mailto:sweil@xxxxxxxxxx] > Sent: Tuesday, June 14, 2016 2:09 PM > To: Allen Samuels <Allen.Samuels@xxxxxxxxxxx> > Cc: Mark Nelson <mnelson@xxxxxxxxxx>; Igor Fedotov > <ifedotov@xxxxxxxxxxxx>; Somnath Roy <Somnath.Roy@xxxxxxxxxxx>; > Manavalan Krishnan <Manavalan.Krishnan@xxxxxxxxxxx>; Ceph > Development <ceph-devel@xxxxxxxxxxxxxxx> > Subject: RE: RocksDB tuning > > On Tue, 14 Jun 2016, Allen Samuels wrote: > > For flash what we want to do is leave min_alloc_size at 4K and figure > > out how to shrink the oNode so that the KV commit fits into a minimal > > number of writes. > > > > There are two obvious things to do w.r.t. shrinking the oNode size: > > > > (1) sophisticated encode/decode function. I've talked about this > > before, hopefully I'll have more time to dig into this shortly. > > > > (2) Reducing the stripe size. A larger stripe size tends to improve > > sequential read/write speeds when the application is doing large I/O > > operations (less I/O fracturing). It will also reduce metadata size by > > amortizing the fixed size of an oNode (i.e., the stuff in an oNode > > that doesn't scale with the object size) across fewer oNodes. Both of > > these phenomenon provide decreasing benefits as the stripe size > increases. > > However, larger oNodes cost more to read/write them for random I/O > > operations. I believe that for flash, the current default stripe size > > of 4MB is too large in that the gains for sequential operations are > > minimal and the penalty on random operations is too large... This > > believe should be subjected to experimental verification AFTER we've > > shrunk the oNode using (1). It's also possible that the optimal stripe > > size (for flash) is HW dependent -- since the variance in performance > > characteristics between different flash devices can be rather large. > > Agreed on both of these. > > Not mutually exclusive with (3), though: increase blob size via larger > min_alloc_size. 4K random write benchmark write-amp aside, I still think we > may end up with an onode size where the lower write latency and half to > quarter-size lextent/blob map reduces metadata compaction overhead > enough to offset the larger initial txn sizes. We'll see when we benchmark. Yes, we will see. I believe that on flash you'll still lose because you're writing the data twice. But let the benchmarking proceed! (I'm assuming an oNode diet has already happended). > > sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html